Fuzzy Model-Based Reinforcement Learning

Size: px
Start display at page:

Download "Fuzzy Model-Based Reinforcement Learning"

Transcription

1 ESIT 2, September 2, Aachen, Germany 212 Fuzzy Model-Based Reinforcement Learning Martin Appl 1, Wilfried Brauer 2 1 Siemens AG, Corporate Technology Information and Communications D-8173 Munich, Germany Phone: , Fax: mail@martinappl.de 2 Technical University of Munich Department of Computer Science D-829 Munich, Germany brauer@informatik.tu-muenchen.de ABSTRACT: Model-based reinforcement learning methods are known to be highly efficient with respect to the number of trials required for learning optimal policies. In this article, a novel fuzzy model-based reinforcement learning approach, fuzzy prioritized sweeping F-PS, is presented. The approachis capableof learning strategies for Markov decision problems with continuous state and action spaces. The output of the algorithm is a Takagi- Sugeno fuzzy system with linear terms in the consequents of the rules. From the Q-function approximated by this fuzzy system an optimal control strategy can be easily derived. The proposed method is applied to the problem of selecting optimal framework signal plans in urban traffic networks. It is shown that the method outperforms existing model-based approaches. KEYWORDS: reinforcement learning, model-based learning, fuzzy prioritized sweeping, Takagi-Sugeno fuzzy systems, framework signal plans INTRODUCTION Reinforcement learning means learning from experiences Sutton and Barto1998, Bertsekas and Tsitsiklis1996. A reinforcement learning agent perceives certain characteristics of its environment, influences the environment by performing actions and finally gets rewards due to the appropriateness of the selected actions. One can distinguish between indirect and direct reinforcement learning methods. Indirect methods, like e.g. prioritized sweeping Moore and Atkeson1993, build an internal model of the environment and calculate the optimal policy based on this model, whereas direct methods, like Q-learning Watkins1989, do not use an explicit model but learn directly from experiences. Indirect reinforcement learning methods are known to learn in many settings much faster than direct methods, since they can reuse information stored in their internal model. Learning models of discrete environments is much easier than learning models of continuous ones. This may be the reason why most publications on model-based reinforcement learning deal with discrete Markov decision problems. Discrete methods, of course, can be also applied to continuous problems by discretizing the state and action spaces of these problems. The main challenge of this approach, however, is to define a partition of reasonable granularity, since fine partitions lead to a high number of states and thus complex problems, whereas approximations based on coarse crisp partitions can be highly imprecise. Model-based learning in continuous state spaces was previously discussed by Davies1997, who suggested to define a coarse grid on the state space and to approximate the continuous value function by performing interpolation based on this grid. This approximation approach is comparable to a Takagi-Sugeno fuzzy system with triangular membership functions and constant terms in the consequents of the rules. Davies, however, used a crisp partition for the training of the transition probabilities and the corresponding rewards, which seems to be inconsistent with the idea of interpolating. Besides, he did not consider continuous actions. In this article, a fuzzy model-based reinforcement learning approach, fuzzy prioritized sweeping F-PS, is considered. The approach is capable of learning strategies for problems with continuous state and action spaces. The output of the F-PS approach is a Takagi-Sugeno fuzzy systems with linear rules Takagi and

2 ESIT 2, September 2, Aachen, Germany 213 Sugeno1985. With such fuzzy systems continuous value functions can be approximated much more precisely than with approximation architectures based on crisp partitions. Alternatively, the number of partitioning subsets can be reduced. The proposed method is applied to the problem of selecting framework signal plans in dependence of traffic conditions. Several approaches applying reinforcement learning to problems from traffic signal control can be found in the literature e.g. Thorpe1997, Bingham1998, Appl and Palm1999. To the authors, however, no publication on the selection of framework signal plans by means of reinforcement learning methods is known. In the following section the basic Markov decision problem on which the further considerations are based is introduced. Afterwards, the fuzzy model-based reinforcement learning approach is presented. Finally, the effectiveness of the proposed algorithm is shown by the task of selecting framework signal plans. BASIC MODEL In the following it is assumed that the reinforcement learning agent gets inputs from a continuous state space X of dimension N X and may perform actions taken from a continuous action space A of dimension N A.Thesetsof dimensions of the state space and the action space will be denoted by D X := {1,..., N X } and D A := {1,..., N A } respectively. Let, for each state x X and each action a A, py; x, a beaprobability density function giving the distribution of the successor state y if action a is executed in state x. Furthermore, let gx, a, y Rbe the unknown reward the agent gets for executing action a in state x if the action causes a transition to state y. The agent is supposed to select actions at discrete points in time. The goal of the learning task then is to find a stationary policy µ : X A, i.e. a mapping from states to actions, such that the expected sum of discounted future rewards { N } J µ x := lim E α κ g x κ,µx κ, x κ+1 N x = x, α [, 1 1 κ= is maximized for each x X,wherex κ+1 is determined from x κ using px κ+1 ; x κ,µx κ. Let Q µ x, a := py; x, a gx, a, y+α J µ y dy, 2 be the sum of discounted future rewards the agent may expect if it executes action a in state x and behaves according to the policy µ afterwards. Then, the optimal Q-values Q µ x, a are given by the fixed-point solution of the Bellman equation Q µ x, a = py; x, a gx, a, y+α max b A Q µ y, b dy, 3 and the optimal policy µ is to execute in each state x the action a that maximizes these Q-values: µ x :=argmax Q µ x, a. 4 a A µ The F-PS approach described in the following approximates the continuous Q-function Q by a Takagi-Sugeno fuzzy system. Thereto, it is assumed that a fuzzy partition {µ X i } i I of the state space is defined, where the subscripts of the N µx membership functions are given by I = {1,..., N µx } and the labels and centers of the partitioning subsets are given by {X i } i I and { x i } i I, respectively. Likewise, it is assumed that the action space is partitioned by {µ A u } u U, whereu = {1,..., N µa } is the set of subscripts, {A u } u U gives the labels and {ã u } u U the centers of the N µa subsets of the partition. FUZZY MODEL-BASED LEARNING The basic idea of the F-PS approach presented in the following is to learn an approximation of the unknown continuous Q-function Q µ, from which the optimal strategy can be easily derived cf. eqn. 4. The Q-function will be approximated by the Takagi-Sugeno fuzzy system Takagi and Sugeno1985, Sugeno1985 if x is X i and a is A u then Q µ x, a = ˆQ iu + l D X ˆQxl iu x l x i,l + l D A ˆQal iu a l ã u,l, i I,u U, 5

3 ESIT 2, September 2, Aachen, Germany 214 where ˆQ iu is an estimate of the average Q-value in X i,a u, and ˆQ x l iu and ˆQ a l iu are estimates of the local partial µ δ Q derivatives x i, ã u and δ Q µ δa l x i, ã u, respectively. The estimation of the average Q-values and the average partial derivatives will be considered in the following subsections. ESTIMATION OF AVERAGE Q-VALUES Let N iu,k be counters giving the number of executions of fuzzy action A u in fuzzy state X i until iteration k i I,u U. Likewise, let M iuj,k be counters giving the number of times that the execution of action A u in state X i caused a transition to X j i, j I,u U. On the observation of a transition x k, a k, x k+1, x k X, x k+1 X, a k A,withreward g k Rthese counters are increased according to the degrees of membership of the transition in the corresponding centers: N N iu,k + µ X i x k µ A u a k, i I,u U, 6 M M iuj,k + µ X i x kµ A u a kµ X j x k+1, i I,u U,j I. 7 Based on these counters one can estimate the probability µ X i xµa u aµ X j y py; x, adydadx p ij u := µ X i xµa u adadx that the execution of action A u in state X i causes a transition to state X j : ˆp ij,k+1 u := M. 9 N Let g iuj be the average reward the agent may expect if it executes action A u in state X i and the action causes a transition to state X j : µ X i xµa u aµx j y py; x, a gx, a, ydydadx g iuj := µ X i xµa u aµx j y py; x, adydadx. 1 Then, an estimate ĝ iuj of these average rewards can be gained by performing the update ĝ ĝ iuj,k + µx i x kµ A u a k µ X j x k+1 M [ g k ĝ iuj,k ], i I,u U,j I 11 on the observation of transitions x k, a k, x k+1, x k X, x k+1 X, a k Awith rewards g k R. Based on the discrete model ˆp ij,k+1 u, ĝ, one can nowcalculate average Q-values. It can be shown that the solution of the fixed point equation ˆQ = ˆp ij,k+1 u ĝ + α max ˆQ jv,k+1 12 v U j I gives estimates of the average Q-values Q iu := µ X i xµa u a Q µ x, adadx µ X. 13 i xµa u adadx These estimates can be used in the representation 5 of the Q-function. The system 12 can be advantageously solved by discrete prioritized sweeping Moore and Atkeson1993. ESTIMATION OF AVERAGE PARTIAL DERIVATIVES The partial derivatives Q x l iu and Qa l iu of the Q-function can be derived from average values and partial derivatives of the reward function and the transition probabilities. It can be shown that the following is satisfied for the 8

4 ESIT 2, September 2, Aachen, Germany 215 partial derivatives with respect to the dimensions of the state space: iu = δ Q µ x i, ã u Q x l 3 = δ py; x, ã u gx, ã u, y+αmax Q µ y, b b A dy x= x i 14 p x l ij g u iuj + α max Q jv + p ij ug x l iuj, 15 v U j I where the average rewards g iuj and transition probabilities p ij u were defined in the preceding section and the average derivatives p x l ij u andgx l iuj are given by µ X i xµa u a µ X j y δ py; x, ady da dx p x l ij u := µ X, 16 i xµa u ada dx g x l iuj := µ X i xµa u aµ X j y δ gx, a, y py; x, ady da dx µ X i xµa u aµ X j x py; x, ady da dx. 17 Likewise, the partial derivatives with respect to the dimensions of the action space can be approximated as follows Q a l iu = δ Q x i, ã u p a l ij g δa u iuj + α max Q jv + p ij ug a l iuj, 18 l v U j I where the abbreviations p a l ij u := g a l iuj := µ X i xµa u a µ X j y δ δa l py; x, ady da dx µ X i xµa u ada dx, 19 µ X i xµa u aµ X j y δ δa l gx, a, y py; x, ady da dx µ X i xµa u aµ X j y py; x, ady da dx 2 were introduced. In the following subsections, it will be shown how the average partial derivatives of the reward function and the conditional probability density function can be estimated from observed transitions. Then, the partial derivatives of the Q-function can be estimated using the approximations 15 and 18. Partial Derivatives of the Reward Function The average local reward g iuj and the average local derivatives g x l iuj and ga l iuj of the reward function g can be estimated by adapting the parameters ĝ iuj,ĝ x l iuj and ĝa l iuj of the following linear function to experiences in the vicinity of the center x i, ã u, x j : ǧx, a, y :=ĝ iuj + l D X ĝ xl iuj x l x i,l + l D A ĝ al iuj a l ã u,l + l D X ĝ yl iuj y l x j,l. 21 On the observation of a transition x k, a k, x k+1 withreward g k, the parameters can be adapted by performing a gradient descent with respect to the following error measure: Let E := 1 2 g k ǧx k, a k, x k η iuj,k := µx i x kµ A u a k µ X j x k+1 M, 23

5 ESIT 2, September 2, Aachen, Germany 216 be the stepsizes for the gradient descent, such that the stepsize for a given center is weighted by the membership of observed transitions in this center and decreases gradually. Based on 22 and 23, the following update rules can be derived i, j I,u U: ĝ = ĝ iuj,k + η iuj,k g k ǧx k, a k, x k+1, 24 ĝ x l = ĝ x l iuj,k + η iuj,kx k,l x i,l g k ǧx k, a k, x k+1, l D X 25 ĝ a l = ĝ a l iuj,k + η iuj,ka k,l ã u,l g k ǧx k, a k, x k+1, l D A 26 ĝ y l = ĝ y l iuj,k + η iuj,kx k+1,l x j,l g k ǧx k, a k, x k+1 l D X. 27 Note that an alternative update rule for ĝ was defined in 11. Partial Derivatives of the Conditional Probability Density Function The average partial derivatives of the conditional probability density function can be approximated as follows: p x l ij u p a l ij u µ X i xµa u a µ X X py;x+ɛen l,a py;x ɛe j y N X l,a 2ɛ µ X i xµa u a µ X i xµa u ada dx µ X A py;x,a+ɛen l j y dy da dx py;x,a ɛe N A l 2ɛ dy da dx, 28 µ X i xµa u ada dx, 29 where e d l is a vector of dimension d with components e d l,i = δ il,i =1,..., d, δ is the Kronecker symbol and ɛ is a small constant. Let L x l,+ iu count the number of executions of action A u in a fuzzy state that results from shifting state X i along dimension l by ɛ, andletm x l,+ iuj count the number of times that action A u caused a transition from this state to X j. Likewise, let L x l, iu be a counter for the number of executions of action A u in a state that results from shifting state X i along dimension l by ɛ, andletm x l, iuj count the number of times that A u caused a transition from this state to X j. On the observation of a transition x k, a k, x k+1,g k, these counters can be updated as follows i I,u U: In a similar way counters L a l,+ iu I,u U: L x l,+ L x l,+ iu,k + µx i x k ɛe N X l µ A u a k, 3 M x l,+ M x l,+ iuj,k + µx i x k ɛe N X l µ A u a kµ X j x k+1, j I, 31 L x l, L x l, iu,k + µx i x k + ɛe N X l µ A u a k, 32 M x l, M x l, iuj,k + µx i x k + ɛe N X l µ A u a k µ X j x k+1, j I. 33,M a l,+ iuj, La l, iu and M a l, iuj with the following update rules can be defined i L a l,+ L a l,+ iu,k + µx i x kµ A u a k ɛe N A l, 34 M a l,+ M a l,+ iuj,k + µx i x k µ A u a k ɛe N A l µ X j x k+1, j I, 35 L a l, L a l, iu,k + µx i x kµ A u a k + ɛe N A l, 36 M a l, M a l, iuj,k + µx i x kµ A u a k + ɛe N A l µ X j x k+1, j I. 37 Then, the average partial derivatives 28 and 29 can be estimated as follows i, j I,u U: ˆp x l ij,k+1 u := 1 M x l,+ M x l,, 38 2ɛ ˆp a l ij,k+1 u := 1 2ɛ L x l,+ M a l,+ L a l,+ L x l, M a l, L a l,. 39

6 q t q t q t ESIT 2, September 2, Aachen, Germany 217 request cycle time extension residential area A shopping center south +cinema shopping center north Figure 1: Example framework signal plan and test scenario. OPTIMAL SELECTION OF FRAMEWORK SIGNAL PLANS B D C industrial area Framework signal plans define constraints on signal control strategies in traffic networks. A framework signal plan usually comprises individual signal plans for all traffic signals controlled by the framework signal plan. In the left part of figure 1 an example signal plan is depicted. Green phases of the traffic signal controlled according to this signal plan have to start within the request -interval and have to end within the extension - interval. Within the leeway given by signal plans, traffic-dependent optimization may be performed or public transportation may be prioritized. Sophisticated traffic control systems are able to choose between different framework signal plans in dependence of traffic conditions. The rules controlling this selection are usually tuned by hand, which is not trivial in complex traffic networks. The task of selecting framework signal plans in dependence of traffic conditions, however, can be considered as a Markov decision problem, where the state is composed of measurements made on the traffic network and the framework signal plans are the available actions. In the following the scenario shown in the right part of figure 1 will be considered. The traffic density is measured at the three points indicated by arrows. It is assumed that three framework signal plans are given. Plan 1 favors horizontal traffic streams and should therefore be used in the morning when people go to work. In Plan 2, horizontal and vertical phases have the same length, such that this plan is suitable at noon and in the afternoon when people go shopping and return from work. The third plan finally favors traffic flows between the residential area and the cinema and should therefore be selected in the evening. During learning the controller gets the following rewards: g := l ρl ρ l,max 2, 4 where ρ l and ρ l,max give the average and maximum density, respectively, of vehicles in link l. The basic idea behind this definition is that the average density in the road network is to be minimized, where homogeneous states in which all roads have a similar density result in larger rewards than inhomogeneous states. X µ i x 1 is_vs is_s is_m is_h is_vh.5 1 ρ/ρ max X µ i x 1 is_vs is_s is_m is_h is_vh.5 1 Figure 2: Partitions of sensor signals for PS left and F-PS approach right. ρ/ρ max

7 ESIT 2, September 2, Aachen, Germany 218 total average density per day : F PS : PS number of simulated days Figure 3: Progress of framework signal plan selection with prioritized sweeping PS and fuzzy prioritized sweeping F-PS. Two algorithms were applied to this Markov decision problem: Training with prioritized sweeping Moore and Atkeson1993, where the state space was discretized by the crisp partition shown in the left part of figure 2 PS, and training with the fuzzy prioritized sweeping approach proposed in this article F-PS, where the fuzzy partition shown in the right part of figure 2 was used. The progress of these algorithms is shown in figure 3. For the plot, training was interrupted every two simulated days and the strategy learned until then was applied to the network for one further simulated day. The total rewards gained in the courses of these evaluation days are shown in figure 3, where averages over 1 runs are shown in order to reduce statistical effects. The learning task, obviously, is solved much faster by the fuzzy model-based approach than by the crisp approach. Moreover, the strategy learned by F-PS is superior to the strategy learned by PS, i.e. the continuous Q-function, obviously, can not be approximated sufficiently good by an architecture based on the crisp partition shown in figure 2. CONCLUSIONS In this article a novel fuzzy model-based reinforcement learning approach was presented. The approach represents continuous Q-functions by Takagi-Sugeno models with linear consequents. As Q-functions directly represent control knowledge, control strategies learned by the F-PS approach can be expected to be superior to strategies learned by methods based on crisp partitions. The proposed method was applied to the task of selecting optimal framework signal plans in dependence of traffic conditions. As expected, the proposed method outperforms the crisp PS approach when used with partitions of similar granularity. In the example of application presented in this article actions were discrete. The proposed algorithm, however, also performs well in environments with continuous action spaces, as can be easily tested with small toyexamples. Real-world problems with continuous action spaces will be considered in future publications. REFERENCES Appl, M.; Palm, R., 1999, Fuzzy Q-learning in nonstationary environments, Proceedings of the 7th European Congress on Intelligent Techniques and Soft Computing. Bertsekas, D. P.; Tsitsiklis, J. N., 1996, Neuro-Dynamic Programming, Athena Scientific. Bingham, E., 1998, Neurofuzzy traffic signal control, Master s thesis, Helsinki University of Technology. Davies, S., 1997, Multidimensional triangulation and interpolation for reinforcement learning, Advances in Neural Information Processing Systems, Volume 9, pp , The MIT Press. Moore, A. W.; Atkeson C. G., 1993, Memory-based reinforcement learning: Converging with less data and less time, Robot Learning, pp Sugeno, M., 1985, An introductory survey of fuzzy control, Information Sciences 36, pp Sutton, R. S.; Barto, A. G., 1998, Reinforcement Learning An Introduction, The MIT Press. Takagi, T.; Sugeno, M., 1985, Fuzzy identification of systems and its application to modeling and control, IEEE Transactions on Systems, Man and Cybernetics, Volume 15, pp Thorpe, T., 1997, Vehicle Traffic Light Control Using SARSA, Ph. D. thesis, Department of Computer Science, Colorado State University. Watkins, C. J. C. H., 1989, Learning from Delayed Rewards, Ph. D. thesis, Cambridge University.

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Consistency of Fuzzy Model-Based Reinforcement Learning

Consistency of Fuzzy Model-Based Reinforcement Learning Consistency of Fuzzy Model-Based Reinforcement Learning Lucian Buşoniu, Damien Ernst, Bart De Schutter, and Robert Babuška Abstract Reinforcement learning (RL) is a widely used paradigm for learning control.

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Abstract Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient Efficient Non-Linear Control by Combining Q-learning with Local Linear Controllers Hajime Kimura Λ Tokyo Institute of Technology gen@fe.dis.titech.ac.jp Shigenobu Kobayashi Tokyo Institute of Technology

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

MODELLING OF TOOL LIFE, TORQUE AND THRUST FORCE IN DRILLING: A NEURO-FUZZY APPROACH

MODELLING OF TOOL LIFE, TORQUE AND THRUST FORCE IN DRILLING: A NEURO-FUZZY APPROACH ISSN 1726-4529 Int j simul model 9 (2010) 2, 74-85 Original scientific paper MODELLING OF TOOL LIFE, TORQUE AND THRUST FORCE IN DRILLING: A NEURO-FUZZY APPROACH Roy, S. S. Department of Mechanical Engineering,

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Payments System Design Using Reinforcement Learning: A Progress Report

Payments System Design Using Reinforcement Learning: A Progress Report Payments System Design Using Reinforcement Learning: A Progress Report A. Desai 1 H. Du 1 R. Garratt 2 F. Rivadeneyra 1 1 Bank of Canada 2 University of California Santa Barbara 16th Payment and Settlement

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège Optimal sequential decision making for complex problems agents Damien Ernst University of Liège Email: dernst@uliege.be 1 About the class Regular lectures notes about various topics on the subject with

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

Optimal Tuning of Continual Online Exploration in Reinforcement Learning Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens {youssef.achbany, francois.fouss, luh.yen, alain.pirotte,

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

More information

An application of the temporal difference algorithm to the truck backer-upper problem

An application of the temporal difference algorithm to the truck backer-upper problem ESANN 214 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 23-25 April 214, i6doc.com publ., ISBN 978-28741995-7. Available

More information

Lazy learning for control design

Lazy learning for control design Lazy learning for control design Gianluca Bontempi, Mauro Birattari, Hugues Bersini Iridia - CP 94/6 Université Libre de Bruxelles 5 Bruxelles - Belgium email: {gbonte, mbiro, bersini}@ulb.ac.be Abstract.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Reinforcement Learning with Recurrent Neural Networks

Reinforcement Learning with Recurrent Neural Networks Institut für Informatik Neuroinformatics Group Reinforcement Learning with Recurrent Neural Networks Dissertation zur Erlangung der Doktorwürde der Universität Osnabrück Fachbereich Mathematik/Informatik

More information

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Neural Networks, vol.22, no.0, pp.399 40, 2009. Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Hirotaka Hachiya (hachiya@sg.cs.titech.ac.jp) Department

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo Department of Electrical and Computer Engineering, Nagoya Institute of Technology

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information