A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Size: px
Start display at page:

Download "A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning"

Transcription

1 A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning Long Yang, Minhao Shi, Qian Zheng, Wenjia Meng, and Gang Pan College of Computer Science and Technology, Zhejiang University, China Abstract Recently, a new multi-step temporal learning algorithm Q unifies n-step Tree-Backup when 0 and n-step Sarsa when 1 by introducing a sampling parameter. However, similar to other multi-step temporal-difference learning algorithms, Q needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the offline updates into efficient on-line ones which consume less memory and computation time. In this paper, we combine the original Q with eligibility traces and propose a new algorithm, called Q,, where is trace-decay parameter. This new algorithm unifies Sarsa when 1 and Q when 0. Furthermore, we give an upper error bound of Q, policy evaluation algorithm. We prove that Q, control algorithm converges to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of, Q, creates a mixture of the existing algorithms which learn the optimal value significantly faster than the extreme end 0, or 1. 1 Introduction In reinforcement learning RL, experiences are sequences of states, actions and rewards which are generated by the agent interacts with environment. The agent s goal is learning from experiences and seeking an optimal policy from the delayed reward decision system. There are two fundamental mechanisms of learning have been studied in RL, one is temporaldifference TD learning method which is a combination of Monte Carlo method and dynamic programming Sutton, The other one is eligibility trace Sutton, 1984; Watkins, 1989 which is a short-term memory process as a function of states. TD combines TD learning with eligibility trace and unifies one-step learning 0 and Monte Carlo methods 1 through the trace-decay parameter Sutton, Results show that the unified algorithm Corresponding author. TD performs better at an intermediate value of rather than either extreme case. Recently, Multi-step Q Sutton and Barto, 2017 unifies n-step Sarsa 1, full-sampling and n-step Treebackup 0, pure-expectation. For some intermediate value 0 < < 1, Q creates a mixture of full-sampling and pure-expectation approach, can perform better than the extreme case 0 or 1 De Asis et al., The results in De Asis et al., 2018 implies a fundamental trade-off problem in reinforcement learning: should one estimates the value function by adopting pure-expectation 0 algorithm or full-sampling 1 algorithm? Although pure-expectation approach has lower variance, it needs more complexity and larger calculation Van Seijen et al., On the other hand, full-sampling algorithm needs smaller calculation time, however, it may have a worse asymptotic performance De Asis et al., Multi-step Q Sutton and Barto, 2017 firstly attempts to combine pure-expectation with full-sample algorithms, however, multi-step temporaldifference learning is too expensive in complexity of time and space during its training. Eligibility trace is an effective way to address the above complexity problem. In this paper, we try to combine the Q algorithm with eligibility trace, and create a new algorithm, called Q,. The proposed Q, unifies the Sarsa algorithm Rummery and Niranjan, 1994 and Q algorithm Harutyunyan, When varies from 0 to 1, Q, changes continuously from Sarsa 1 in Q, to Q 0 in Q,. We also focus on the trade-off between pure-expectation and full-sample in control task, our experiments show that an intermediate value achieves a better performance than either extreme case. Our contributions are summarised as follows: We define a mixed-sampling operator through which we derive the proposed Q, policy evaluation algorithm and control algorithm. The new algorithm Q, unifies Sarsa and Q. For Q, policy evaluation algorithm, we give its upper error bound. For the control problem, we prove that both of the off-line and on-line Q, algorithm converge to the optimal value function. 2 Framework and Notation The standard episodic reinforcement learning framework Sutton and Barto, 2017 is often formalized as Markov deci- 2984

2 sion processes MDPs. Such framework considers 5-tuples form M S, A, P, R, γ, where S indicates the set of all states, A indicates the set of all actions, P a is the element ss of P and it indicates a state-transition probability from state s to state s under taking action a, a A, s, s S; R a ss is the element of R and it indicates the expected reward for a transition, γ 0, 1 is the discount factor. A policy is a probability distribution on S A and stationary policy is a policy that does not change over time. In this paper, we denote {S t, A t, R t } t 0 as a trajectory of the state-reward sequence in one episode. state-action value q maps on S A to R, for a given policy, has a corresponding state-action value: q s, a E γ t R t S 0 s, A 0 a. Optimal state-action value is defined as: Bellman operator T : Bellman optimal operator T : q s, a max q s, a. T q R + γp q, 1 T q max R + γp q, 2 where P R S S, R R S A, and the corresponding entry is: P ss s, ap a ss, R s, a s, a P a ss Ra ss. a A s S Value function q and q satisfy the following Bellman equation and optimal Bellman equation correspondingly: T q q, T q q. Both T and T are γ-contraction operator with sup-norm, that is to say, T Q 1 T Q 2 γ Q 1 Q 2 for any Q 1, Q 2 R S A, where T T or T. Based on the fact that the fixed point of a contraction operator is unique, then value iteration converges: T n Q q, T n Q q, as n, for any initial Q. 2.1 One-step TD Learning Algorithms Unfortunately, both the system 1 and 2 can not be solved directly because of fact that the P and R in the environment are usually unknown or the dimension of S is huge in practice. A unavailable model in reinforcement learning is called model free. TD learning algorithm Sutton, 1984; Sutton, 1988 is one of the most significant algorithms in model free reinforcement learning, the idea of bootstrapping is critical to TD learning: the evluation of the value function are used as targets during the learning process. In reinforcement learning, target policy µ is the policy being learned about, and the policy µ used to generate trajectory {S t, A t, R t } t 0 is called the behavior policy. If µ, the learning is called on-policy learning, otherwise it is off-policy learning. Sarsa: For a given sample transition S, A, R, S, A, Sarsa Rummery and Niranjan, 1994 is a on-policy learning algorithm and it updates Q value as follows: Q k+1 S, A Q k S, A + α k δ S k, 3 δ S k R + γq k S, A Q k S, A, 4 where δ S k is the k-th TD error, α k is stepsize. Expected-Sarsa: Expected-Sarsa Van Seijen et al., 2009 uses expectation of all the next state-action value pairs according to the target policy to estimate Q value as follows: Q k+1 S, A Q k S, A + α k δ ES k, δ ES k R + γe Q k S, Q k S, A, 5 R + a A S, aq k S, a Q k S, A, where δk ES is the k-th expected TD error. Expected-Sarsa is a off-policy learning algorithm if µ, for example, when is greedy with respect to Q then Expected-Sarsa is restricted to Q-Learning Watkins, If the trajectory is generated by, Expected-Sarsa is a on-policy algorithm Van Seijen et al., The above two algorithms are guaranteed convergence under some conditions Singh et al., 2000; Van Seijen et al., Q: One-step Q Sutton and Barto, 2017; De Asis et al., 2018 is a weighted average between the Sarsa update and Expected Sarsa update through a sampling parameter : Q k+1 S, A Q k S, A + α k δ k, δk δk S + 1 δk ES, 6 where 0, 1 is degree of sampling. When 1, Q denotes a full-sampling, and 0, Q denotes a pureexpectation with no sampling. δt S and δt ES are defined in 4 and Return Algorithm One-step TD learning algorithm can be generalized to multistep bootstrapping learning method. The -return algorithm Watkins, 1989 is a particular way to mix many multi-step TD learning algorithms through weighting n-step returns proportionally to n 1. -operator 1 T is a flexible way to express -return algorithm, for a given trajectory {S t, A t, R t } t 0 that is generated by policy, T qs, a { 1 n0 } n T n+1 q s, a 1 n E G n S 0 s, A 0 a n0 qs, a + γ n E δ n, n0 1 The notation is coincident with Bertsekas et al.,

3 where G n n γt R t + γ n QS n+1, A n+1 is n-step returns from initial state-action pair S 0, A 0, the term n0 1 n G n is called -returns, and δ n R n + γqs n+1, A n+1 QS n, A n. Based on the fact that q is fixed point of T, thus q remains the fixed point of T. When 0, T is equal to the usual Bellman operator T. When 1, the evaluation iteration Q k+1 T 1Q k becomes Monte Carlo method. It is well-known that trades off the bias of the bootstrapping with an approximate q, with the variance of sampling multi-step returns estimation Kearns and Singh, In practice, a high and intermediate value of could be typically better Singh and Dayan, 1998; Sutton, Mixed-sampling Operator In this section, we present the mixed-sampling operator T,µ, which is one of our key contribution and is flexible to analysis our new algorithm later. By introducing a sampling parameter 0, 1, the mixed-sampling operator varies continuously from pure-expectation method to full-sampling method. In this section, we analysis the contraction of T,µ firstly. Then we introduce the -return vision of mixed-sampling operator T,µ. Finally, we give a upper error bound of the corresponding policy evaluation algorithm. 3.1 Contraction of Mixed-sampling Operator Definition 1. Mixed-sampling operator T,µ is a map on R S A to R S A, s S, a A, 0, 1 : T,µ : R S A R S A qs, a qs, a + E µ γ t δ, t where δ, t R t + γqs t+1, A t+1 QS t, A t +1 R t + γe QS t+1, QS t, A t E QS t+1, a A S t+1, aqs t+1, a. The parameter is also degree of sampling that is introduced by the Q algorithm De Asis et al., In one of extreme end 0, pure-expectation, T,µ 0 deduces the n-step returns G n in Q Harutyunyan, 2016, where G n t+n + γ n+1 E QS t+n+1,, δk ES is the k- kt γk t δk ES th expected TD error. Multi-step Sarsa Sutton and Barto, 2017 is in another extreme end 1, full-sampling. Every intermediate value creates a mixed method varies continuously from pure-expectation to full-sampling which is why we call T,µ mixed-sample operator. -Return Version We now define the -version of T,µ and denote it as T,µ : T,µ qs, a qs, a + E µ 7, γ t δ, t 8 where the is the parameter takes the from TD0 to Monte Carlo version as usual. When 0, T,µ 0, is restricted to T,µ Harutyunyan, 2016, when 1, T,µ 1, is restricted to -operator. The next theorem provides a basic property of T,µ. Theorem 1. The operator T,µ is a γ-contraction: for any q 1, q 2, T,µ q 1 T,µ q 2 γ q 1 q 2. Furthermore, for any initial Q 0, the sequence {Q} k0 is generated by the iteration then {Q k } k0 Q k+1 T,µ Q k, converges to the unique fixed point of T,µ. Proof. Unfolding the operator T,µ, we have T,µ q q + BT µ q q T µ q + 1 q + BT q q, 9 T,µ where B I γp µ 1. Based the fact that both T µ,µ Bertsekas et al., 2012 and T Harutyunyan, 2016; Munos et al., 2016 are γ-contraction operators, and T,µ is the convex combination of above operators, thus T,µ is a γ-contraction. 3.2 Upper Error Bound of Policy Evaluation In this section we discuss the ability of policy evaluation iteration Q k+1 T,µ Q k in Theorem 1. Our results show that when µ and are sufficiently close, the ability of the policy evaluation iteration increases gradually as the decreases from 1 to 0. Lemma 1. If a positive sequence {a k } k1 satisfies a k+1 αa k + β, then for any α < 1, we have a k β 1 α αk a 1 β 1 α. Furthermore, for any ɛ > 0, K, s.t, k > K has the following estimation a k β 1 α + ɛ. Theorem 2 Upper error bound of policy evaluation. Consider the policy evaluation algorithm Q k+1 T,µ Q k, if the behavior policy µ is ɛ-away from the target policy, in the sense that max s S s, a µs, a 1 ɛ, ɛ < 1 γ γ, and γ1 + 2 < 1, then K, s.t, k > K, the policy evaluation sequence{q k } k0 satisfies Q k+1 q M + γc ɛ 1 γ , where for a given policy, M, C is determined by the learning system. q 2986

4 Proof. Firstly, we provide an equation which could be used later: q + BT µ q q q BI γp µ q q + T µ q q B T q + T µ q T µ q + T µ q + γp µ q q B R µ R + γp µ P q T q +T µ q + γp µ q q T µ q +T µ q + γp µ q q. 10 Rewrite the policy evaluation iteration: Q k+1 T µ Q k + 1 T,µ Q k. Note q is fixed point of T,µ Harutyunyan, 2016, then we merely consider next estimator: Lemma1 Q k+1 q Q k + BT µ Q k Q k q ɛm + γ q + 1 γ M + γc ɛ 1 γ γ1 + 1 γ Q k q The first equation is derived by replacing q in 10 with Q k. Since µ is ɛ-away from, the first inequality is determined the following fact: R s R µ s max { s, a µs, a A ara s } ɛ A max R a s ɛm s, a R R µ ɛm, where M s A max a R a s is determined by the reinforcement learning system and is independent of, µ. M max s S M s. For the given policy, q is a constant that is determined by learning system, we denote it C. Remark 1: The proof in Theorem 2 strictly dependent on the assumption that ɛ is smaller but never to be zero, where the ɛ is a bound of discrepancy between the behavior policy µ and target policy. That is to say, the ability of the prediction of policy evaluation iteration is dependent on the gap between µ and. A more profound discussion about this gap is related to -ɛ trade off Harutyunyan,2016 which is beyond the scope in this paper. 4 Q, Control Algorithm In this section, we present Q, 2 algorithm for control. We analyze the off-line version of Q, which converges to optimal value function exponentially fast. Considering the typical iteration Q k, k, where k is calculated by the following two steps: {µ k } k 0 is a sequence of corresponding behavior policies, 2 The work in Dumke, 2017 proposes a similar method, called Q, control algorithm. Our Q, is different from theirs due to the different version of eligibility trace updating. Step1: policy evaluation Q k+1 T k,µ k Q k, Step2: policy improvement T k+1 Q k+1 T Q k+1, that is k+1 is greedy policy with repect to Q k+1. We call the approach introduced by above step1 and step2 Q, control algorithm. In the following Theorem 3, we presents the convergence rate of Q, control algorithm. Theorem 3 Convergence of Q, Control Algorithm. Considering the sequence {Q k, k } k 0 generated by the Q, control algorithm, for the given, γ 0, 1, then Q k+1 q η Q k q, where η γ γ. Particularly, if < 1 γ 2γ, then 0 < η < 1 and sequence {Q k } k 1 converges to q exponentially fast: Q k+1 q Oη k+1. Proof. By the definition of T,µ T,µ, q T µ,µ q + 1 T q we have 3 : Q k+1 q T µ k Q k q + 1 T k,µ k Q k q γ γ1 Q k q 1 γ 1 γ γ1 + 2 Q k q 1 γ η Algorithm1:On-line Q, algorithm Require:Initialize Q 0 s, a arbitrarily, s S, a A Require:Initialize µ k to be the behavior policy Parameters: step-size α t 0, 1 Repeat for each episode: Zs, a 0 s S, a A Q k+1 s, a Q k s, a s S, a A Initialize state-action pair S 0, A 0 For t 0, 1, 2, T k : Obersive a sample { R t, S t+1, A t+1 µ k δ, k t R t + γ 1 E k Q k+1 S t+1, } +Q k+1 S t+1, A t+1 Q k+1 S t, A t For s S, a A: Zs, a γzs, a+i{s t, A t s, a} Q k+1 s, a Q k+1 s, a + α k δ, k t Zs, a End For S t+1 S t, A t+1 A t If S t+1 is terminal: Break End For 3 The section inequality is based on the next two results: Munos et al., 2016 Theorem2 and Bertsekas et al., 2012 Proposition

5 5 On-line Implementation of Q, We have discussed the contraction property of mixedsampling operator T,µ, and via it we have introduced the Q, control algorithm. Both of the iterations in Theorem 2 and Theorem 3 are the version of offline. In this section, we give the on-line version of Q, and discuss its convergence. 5.1 On-line Learning Off-line learning is too expensive due to the learning process must be carried out at the end of an episode, however, online learning updates value function with a lower computational cost, better performance. There is a simple interpretation of equivalence between off-line learning and on-line learning which means that, by the end of an episode, the total updates of the forward viewoff-line learning is equal to the total updates of the backward viewon-line learning Sutton and Barto, By the view of equivalence 4, on-line learning can be seen as an implementation of offline algorithm in an inexpensive manner. Another interpretation of online learning was provided by Singh and Sutton, 1996, TD learning with accumulate trace comes to approximate everyvisit Monte-Carlo method and TD learning with replace trace comes to approximate first-visit Monte-Carlo method. The iterations in Theorem 2 and Theorem 3 are the version of expectations. In practice, we can only access to the trajectory {S t, A t, R t } t 0. By statistical approaches, we can utilize the trajectory to estimate the value function. Algorithm 1 corresponds to online form of Q,. 5.2 On-line Learning Convergence Analysis We make some common assumptions which are similar to Bertsekas and Tsitsiklis, 1996; Harutyunyan, Assumption 1. t 0 P S t, A t s, a > 0, minimum visit frequency, which implies a complete exploration. Assumption 2. For every historical chain F t in a MDPs, P N t s, a k F t γρ k, where ρ is a positive constants, k is a positive integer and N t s, a denotes the times of the pair s, a is visited in the t-th trajectory F t. For the convenience of expression, we give some notations firstly. Let Q o k,t be the vector obtained after t iterations in the k-th trajectory, and the superscript o emphasizes online learning. We denote the k-th trajectory as {S t, A t, R t } t 0 sampled by the policy µ k. Then the online update rules can be expressed as follows: s, a S A : Q o k,0s, a Q o ks, a { δk,t o R t + γ 1 E k Q o k,ts t+1, } +Q o k,ts t+1, A t+1 Q o k,ts t, A t Q o k,t+1s, a Q o k,ts, a + α k s, azk,ts, o aδk,t o Q o k+1s, a Q o k,t k s, a, where T k is the length of the k-th trajectory. 4 The true online learning was firstly introduced by Seijen and Sutton, 2014, more details in Van Seijen et al., Theorem 4. Based on the Assumption 1 and Assumption 2, step-size α t satisfying, t α t, t α2 t <, k is greedy with respect to Q o k, then Qo k q as k, where is short for with probability one. Proof. After some simple algebra: Q o k+1s, a Q o ks, a + α k G o ks, a N ks, a EN k s, a Qo ks, a, G o ks, a 1 EN k s, a T k z o k,ts, aδ o k,t, where α k EN k s, aα k s, a. We rewrite the off-line update: Q f k+1 s, a Q f k s, a + α kg f k s, a N ks, a EN k s, a Qf k s, a, N k s,a G f k 1 Q EN k s, a k,ts, a, where Q k,t s, a is the -returns at time t when the pair s, a is visited in the k-th trajectory, the superscript f in Q f k+1 s, a emphasizes the forward off-line update. N k s, a denotes the times of the pair s, a visited in the k-th trajectory. We define Res k s, a as the residual between on-line estimate G o k and off-line estimate Gf k s, a in the k-th trajectory: Res k s, a Q o ks, a Q f k s, a. Set k s, a Q o k s, a q s, a which is the error between after k-episode the learned value function Q o k and optimal value function q, then we consider the next random iterative process: k+1 s, a 1 ˆα k s, a k s, a + ˆβ k F k s, a, 11 where ˆα k s, a N ks, aα k s, a, E µk N k s, a ˆβ k s, a α k s, a, F k s, a G o ks, a N ks, a E µk N k s, a q s, a. Step1:Upper bound on Res k s, a: E µk Res k s, a C k max Q o k+1s, a q, 12 max s,a where C k 0. Res k,t s, a 1 EN k s, a t m0 s,a zk,ms, o aδk,m o Q k,ms, a, 2988

6 where C is a consist and k,t kqok,t s, a q s, ak, αm max0ttk {αt s, a}. Based on the condition of step-size in the Theorem 4, αm 0, then we have 12. Step2: maxs,a Eµk Fk s, a γ maxs,a k k s, ak. In fact: γρt max Qfk s, a q Random walk,replace trace Episodes Mountain Car Episodes Q, Q Mountain Car Dynamic Episode Episode Figure 2: The plot shows the the average returns per episode. A right-centered moving average with a window of 20 successive episodes was employed in order to smooth the results. The right plot shows the result of Q, comparing with Q. The left plot shows the result of Q, comparing with Q 0 and Sarsa 1. γ 0.99, step-size α 0.3. Eµk Qk,t s, a q P Nk s, a teµk Qk s, a q From the property of eligibility tracemore details refer to Bertsekas et al., 2012 and Assumption 2, we have: Reward per episode Eµk Fk s, a 0.30 Figure 1: Root-mean-squareRMS error of state values as a function of episodes in 19-state random walk, we consider both accumulating trace and replacing trace. The plot shows the prediction ability of Q, varying, when is fixed to 0.8,γ 1. Gfk s, a + Resk s, a Nk s, a q s, a, Eµk Nk s, a PNk s,a Eµk Qk,t s, a q Eµk Nk s, a +Resk s, a Returns Fk s, a Random walk,accumulate trace 0.35 RMS error kresk,t+1 s, ak αm C k,t + kresk,t s, ak, 0.40 RMS error where 0 t Tk. Resk,t s, a is the difference between the total on-line updates of first t steps and the first t times off-line update in k-th trajectory. By induction on t, we have: s,a Then according to 11, for some t > 0: Eµk Fk s, a t γρ + αt max k k s, a s,a γ max k k s, ak. s,a Step3: Qok q as k. Considering the iteration 11 and Theorem 1 in Jaakkola et al., 1994, then we have Qok q Experiments Experiment for Prediction Capability In this section, we test the prediction abilities of Q, in 19-state random walk environment De Asis et al., The agent at each state has two action : left and right, and taking each action with equal probability. We compare the rootmean-squarerms error as a function of episodes, varies dynamically from 0 to 1 with steps of 0.2. In this experiment, we set behavior policy µ0.5-d,0.5+d, D U 0.01,0.01, where U is uniform distribution. Results in Figure 1 show that the performance of Q, increases gradually as the decreases from 1 to 0, which just verifies the upper error bound in Theorem Experiment for Control Capability We test the control capability of Q, in the classical episodic task, mountain car Sutton and Barto, Because the state space in this environment is continuous, we use tile coding function approximation Sutton and Barto, , and use the version 3 of Sutton s tile coding software n.d. with 8 tilings. In the right part of Figure 2, we collect the data by varing from 0 to 1 with steps of Results show that Q, significantly converges faster than Q. In the left part of Figure 2, results show that the Q, with in an intermediate value can outperform Q and Sarsa. Table1 and Table2 summarize the average return after 50 and 200 episodes. In order to gain more insight into the nature of the results, we run from 0 to 1 with steps of 0.02, we take the statistical method according to De Asis et al., 2018, lower LB and upper UB 95% confidence interval bounds are provided to validate the results. Dynamic in the table mean that does not need to remain constant throughout every episode De Asis et al., 2018.In this experiment, is a function of time, a normal distribution, t N 0.5, The average return after only 50 episodes could be interpreted as a measure of initial performance, whereas the average return after 200 episodes shows how well an algorithm is capable of learningde Asis et al., Results show that Q, with a intermediate value had the best final performance. Algorithm Q 0,,Q Q 1,,Sarsa Q 0.5, Dynamic Mean UB LB Table 1: Average return per episode after 50 Episodes.

7 Algorithm Mean UB LB Q 0,,Q Q 1,,Sarsa Q 0.5, Dynamic Table 2: Average return per episode after 200 episodes. 7 Conclusion In this paper we presented a new method, called Q,, which unifies Sarsa and Q. We solved a upper error bound of Q, for the ability of policy evaluation. Furthermore, we discussed the convergence of Q, control algorithm to q under some conditions. The proposed approach was compared with one-step and multi-step TD learning methods and the results demonstrated its effectiveness. Acknowledgements This work was partly supported by National Key Research and Development Program of China 2017YFB2503, Natural Science Foundation of China No , and Zhejiang Provincial Natural Science Foundation of China LR15F References Bertsekas and Tsitsiklis, 1996 Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro Dynamic Programming. Athena scientific Belmont, MA, Bertsekas et al., 2012 Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena scientific Belmont, MA, De Asis et al., 2018 Kristopher De Asis, J Fernando Hernandez-Garcia, G Zacharias Holland, and Richard S Sutton. Multi-step reinforcement learning: A unifying algorithm. AAAI, Dumke, 2017 Markus Dumke. Double q and q, : Unifying reinforcement learning control algorithms. arxiv preprint arxiv: , Harutyunyan, 2016 Rémi Harutyunyan. Q with offpolicy corrections. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings, volume 9925, page 305. Springer, Jaakkola et al., 1994 Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages , Kearns and Singh, 2000 Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In COLT, pages , Munos et al., 2016 Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pages , Rummery and Niranjan, 1994 Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering, Seijen and Sutton, 2014 Harm Seijen and Rich Sutton. True online td lambda. In International Conference on Machine Learning, pages , Singh and Dayan, 1998 Satinder Singh and Peter Dayan. Analytical mean squared error curves for temporal difference learning. Machine Learning, 321:5 40, Singh and Sutton, 1996 Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 221-3: , Singh et al., 2000 Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 383: , Sutton and Barto, 1998 Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, Sutton and Barto, 2017 Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, Sutton, 1984 Richard S Sutton. Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts, Sutton, 1988 Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 31:9 44, Sutton, 1996 Richard S Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages , Van Seijen et al., 2009 Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 09. IEEE Symposium on, pages IEEE, Van Seijen et al., 2016 Harm Van Seijen, A Rupam Mahmood, Patrick M Pilarski, Marlos C Machado, and Richard S Sutton. True online temporal-difference learning. Journal of Machine Learning Research, 17145:1 40, Watkins, 1989 Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King s College, Cambridge,

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods Compare efficiency of D learning with

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Emphatic Temporal-Difference Learning

Emphatic Temporal-Difference Learning Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

https://www.youtube.com/watch?v=ymvi1l746us Eligibility traces Chapter 12, plus some extra stuff! Like n-step methods, but better! Eligibility traces A mechanism that allow TD, Sarsa and Q-learning to

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Unifying Task Specification in Reinforcement Learning

Unifying Task Specification in Reinforcement Learning Martha White Abstract Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications often couple the dynamics of the

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Off-Policy Temporal-Difference Learning with Function Approximation

Off-Policy Temporal-Difference Learning with Function Approximation This is a digitally remastered version of a paper published in the Proceedings of the 18th International Conference on Machine Learning (2001, pp. 417 424. This version should be crisper but otherwise

More information

Accelerated Gradient Temporal Difference Learning

Accelerated Gradient Temporal Difference Learning Accelerated Gradient Temporal Difference Learning Yangchen Pan, Adam White, and Martha White {YANGPAN,ADAMW,MARTHA}@INDIANA.EDU Department of Computer Science, Indiana University Abstract The family of

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

arxiv: v1 [cs.lg] 20 Sep 2018

arxiv: v1 [cs.lg] 20 Sep 2018 Predicting Periodicity with Temporal Difference Learning Kristopher De Asis, Brendan Bennett, Richard S. Sutton Reinforcement Learning and Artificial Intelligence Laboratory, University of Alberta {kldeasis,

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning in Continuous Action Spaces Reinforcement Learning in Continuous Action Spaces Hado van Hasselt and Marco A. Wiering Intelligent Systems Group, Department of Information and Computing Sciences, Utrecht University Padualaan 14, 3508

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

arxiv: v3 [cs.ai] 7 Jul 2017

arxiv: v3 [cs.ai] 7 Jul 2017 Martha White 1 arxiv:1609.01995v3 [cs.ai] 7 Jul 2017 Abstract Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Speedy Q-Learning. Abstract

Speedy Q-Learning. Abstract Speedy Q-Learning Mohammad Gheshlaghi Azar Radboud University Nijmegen Geert Grooteplein N, 655 EZ Nijmegen, Netherlands m.azar@science.ru.nl Mohammad Ghavamzadeh INRIA Lille, SequeL Project 40 avenue

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Reinforcement Learning. Value Function Updates

Reinforcement Learning. Value Function Updates Reinforcement Learning Value Function Updates Manfred Huber 2014 1 Value Function Updates Different methods for updating the value function Dynamic programming Simple Monte Carlo Temporal differencing

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6. ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading:

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Weighted Double Q-learning

Weighted Double Q-learning Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-7) Weighted Double Q-learning Zongzhang Zhang Soochow University Suzhou, Jiangsu 56 China zzzhang@suda.edu.cn

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information