Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Size: px
Start display at page:

Download "Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm"

Transcription

1 Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu Abstract We consider the problem of controlling and balancing a freely-swinging pendulum on a moving cart. We assume that no model of the nonlinear system is available. We model the problem as a Markov Decision Process and draw techniques from the field of Reinforcement Learning to implement a learning controller. Although slow, the learning process is demonstrated (in simulation) on two challenging control tasks. Introduction Nonlinear control has became a growing field of study nowadays. Although the developed techniques and methods fall way behind the mature field of linear control, there is much interesting in developing new nonlinear control methods other than via linearization. This is probably due to the fact that most phenomena in the real world are inherently nonlinear and linear methods cannot really describe them. On the other hand, there is always need for flexible and adaptive controllers that can operate in a wide range and under uncertain conditions, which cannot be captured by linear controllers. To this end, a new technology and a whole field, namely that of learning systems, has been developed during the last decades. Adaptive control, artificial neural networks, fuzzy logic, machine learning, intelligent control and learning robotics can probably be placed under this umbrella. In this paper, we focus on one such trend that goes by the name reinforcement learning. In this paper, we propose two control tasks on a nonlinear system and we also assume that no model of the system is given. We then try to develop learning controllers that will achieve the task by trial-and-error. Our experience is that although this process can be time-consuming, it can be very successful. We begin by providing the necessary background for this work: Markov Decision Processes and Reinforcement Learning, looking more closely at the components used in this paper, like the Q-Learning algorithm. We then present the nonlinear system that consists of a pendulum on a moving cart, as well as the model that simulates the dynamics. Subsequently, we define the two control task and provide all the details of setting up the problem as a learning problem. The experimental results demonstrate that our learning system completes successfully both tasks in reasonable time exhibiting reasonable performance. Markov Decision Processes A Markov Decision Process (MDP) consists of: a finite set of states S a finite set of actions A a state transition function T : S A! (S), where (S) is a probability distribution over S and T (s; a; s ) is the probability of making a transition s?! a s. a reward function R : SAS! R, where R(s; a; s ) is the (expected) reward for making the transition s?! a s. Many real-world problems (especially in the Operations Research area) can be formulated under this framework. The distinguishing characteristic is the so called Markov Property, namely the property that state transitions are independent of any previous states and/or actions. We can use a transition graph to visualize an MDP. Figure 1 shows such a graph for an MDP with 2 states (battery level: LOW and HIGH) and 3 actions (robot: Wait, Search, Recharge). Transition probabilities and rewards are shown on the edges. The task here refers to a recycling robot that can choose among searching for an empty can, waiting, or reacharging the battery depending on its battery level. 1, R wait α, R search wait high search 1 β, 3 1, recharge 1 α, R search search low wait β, R search 1, R wait Figure 1: The Recycling Robot (Sutton and Barto, 1998)

2 A deterministic policy for an MDP is a mapping : S! A. A stochastic policy is a mapping : S! (A), where (A) is a probability distribution over A. Stochastic policies are also called soft, for they do not commit to a single action per state. We use (s; a) for the probability that policy chooses action a in state s. An example would be the (1? ) soft policy, for some < < 1, which picks a particular action with probability 1? and picks randomly with probability epsilon. Finally, an optimal policy for an MDP is a policy that maximizes 1 the expected total reward over time. Formally: = arg max E (R ) = arg max E h t= t R(t) where h determines the horizon (how long is the process) and is the discount rate (relates to the present value of future rewards). For episodic tasks the horizon is finite, h < 1, and < 1. For continuing tasks h = 1 and < < 1. Value Functions Given an MDP the state value function assigns a value to each state. The value V (s) of a state s under a policy is the expected return when starting in state s and following thereafter. Formally, V (s) = E (R t js t = s) Similarly, the state-action value function assigns a value to each pair (state, action). The value Q (s; a) of taking action a in state s under a policy is the expected return starting from s, taking a, and following thereafter. Q (s; a) = E (R t js t = s; a t = a) Notice that the state and the state-action value functions are related as follows: Q (s; a) = V (s) = s 2S a2a (s; a)q (s; a) T (s; a; s ) [R(s; a; s ) + V (s )] By substituting the expressions above into each other we obtain the following recurrent equations for the value functions, known as the Bellman equations: 2 V (s) = (s; a) 4 3 T (s; a; s ) R(s; a; s ) + V (s ) a2a s 2S R(s; a; s ) + Q (s; a) = s 2S T (s; a; s ) a 2A! (s ; a )Q (s ; a ) It is easy to see that the equations above are linear in V (s) or Q (s; a) and the exact values can be obtain by solving the resulting system of linear equations. The optimal value function is the one that corresponds to the optimal policy. It is true, in this case, that the value of each state or state-action pair is maximized. V (s) = max V (s) Q (s; a) = max Q (s; a) 1 There is a similar definition for minimization problems. It has been proven that for any MDP there exists a deterministic optimal policy. Given that we can derive an expression for the optimal value function from the equations above, by always selecting (with probability 1) the action that maximizes the value. This results to the Bellman optimality equations: V (s) = max a2a Q (s; a) = s 2S 8 < : s 2S T (s; a; s ) h i 9 = T (s; a; s ) R(s; a; s ) + V (s ) ; R(s; a; s ) + max a 2A n Q (s ; a ) Unfortunately, these are nonlinear equations and cannot be solved analytically. There is, however, a dynamic programming algorithm, known as value iteration that iteratively approximates the optimal value function. Why is the optimal value function so important. Simply, because given the optimal value function we can easily derive the optimal policy for the MDP. It is known that the optimal policy is deterministic and greedy with respect to the optimal value function. That means that it picks only one action per state and actually the one that locally maximizes the value function. Formally, in state s, pick action a, such that: a = arg max a2a ( s 2S o h i ) T (s; a; s ) R(s; a; s ) + V (s ) a = arg max Q (s; a) a2a The beauty of the value function is that it turns a difficult global optimization problem over the long term (finding the optimal policy) to a simple local search problem. Notice that given V (s) we need the environment dynamics (T and R) in order to determine the optimal policy, whereas given Q (s; a) we need no additional information. This is a crucial feature that will be exploited later for model-free learning. However, it comes at the expense of a more complicated (2-dimensional) function. Solving MDPs There is a number of methods to solve an MDP, i.e. to determine the optimal policy. We briefly mention four of them. Value Iteration In this case, we try to find (or rather, approximate) the optimal value function by solving the Bellman optimality equations. The optimal policy can be easily found then. The idea is to start with an arbitrary value function and iteratively update the values using the Bellman optimality equations until there is no change. Eventually, the value function converges to the optimal. The cost is O(jAjjSj 2 ) per iteration, but the number of iterations can grow exponentially in the discount factor. Policy Iteration In this case, we manipulate the policy directly. Start with an arbitrary policy, evaluate it (solve the linear Bellman equations), improve it (make it greedy), and repeat until there is no change in the policy (converges to the optimal one). The cost in this case is O(jAjjSj 2 + jsj 3 ) per iteration, but the number of iterations is bounded by jaj jsj,

3 the total number of all possible distinct policies. Unfortunately, there is no proof that it can be that big! Modified Policy Iteration This method combines the previous two. Instead of solving exactly the Bellman equations to evaluate an intermediate policy (jsj 3 ), one can run a few steps of value iteration? O(jAjjSj 2 ) to just approximate it, the improve the policy and repeat. It turns out that the method converges to the optimal policy and is very efficient in practice. Linear Programming Finally, MDPs can be recast as Linear Programming problems. This is the only known polynomial algorithm, although it is not always efficient in practice. Reinforcement Learning So far we have assumed that a model is given and the dynamics are known. Uncertainty, however, is inherent in the real world. Sensors are limited and/or inaccurate, measurements cannot be taken at all times, many systems are unpredictable, etc. It is the case that for many interesting problems, the transition probabilities T (s; a; s ) or the reward function R(s; a; s ) are unavailable. Can we still make good decisions in such a case? On what basis? Let s begin from what is available. We assume that we have complete observability of the state, that is, we know where the state we are in with certainty 2. Besides, we can sample the unknown functions T (s; a; s ) and R(s; a; s ) through interaction with the environment. Based on that experience, i.e. reinforcements from the environment, can we learn to act optimally in the world? That leads us to the field of Reinforcement Learning (RL) that has recently received significant attention. It is better to think of RL as class of problems, rather than as a class of methods. These are exactly the problems where a complete, interactive, goal-seeking agent learn to act in its environment by trial-and-error. There are several issues associated with RL problems. We focus on three of them: Delayed Rewards In many problems, reward is received only after long periods. Consider chess or any similar game. The agent cannot be rewarded before the end of the game; only then the outcome is known. Credit assignment problem This problem has two aspect. Assuming that the agent receives a big amount of reward (or punishment) which one of the most recent (sequential) decisions was responsible for that (temporal credit assignment). Also, what of the system is responsible for that (structural credit assignment). Exploration and Exploitation Since the agent is not given information in advance it has to learn by exploration. Exploration results in some knowledge that can be exploited 2 The cases where there is hidden state information are formulated as Partially Observable Markov Decision Processes (POMDPs). We do not discuss POMDPs here, but we consider a case with hidden state later. to gain more and more reward. But there might be better ways to get rewarded. So, exploration and exploitation pull toward different directions. How can the agent balance the trade-off between them? Given the situation that an RL agent faces, there are two general ways to go: Model-Based Approaches Learn a model of the process and use the model to derive an optimal policy. This corresponds to the so-called indirect control. Model-Free Approaches Learn a policy without learning a model. This corresponds to the so-called direct control. For this work we focus on model-free methods for RL problems. Such methods mostly proceeded by trying to estimate (learn) the optimal state-action value function of the MDP. If this is possible, we can figure out the optimal policy without a model. Two broad classes of such methods are the following: Monte-Carlo (MC) Methods Monte-Carlo methods estimate the values by sampling total rewards over long runs starting from particular state-action pairs. Given enough data this leads to a very good approximation. By using exploring starts or a stochastic policy the agent maintains exploration during learning. Temporal Difference (TD) Methods In this case, the agent estimates the values based on sample rewards and previous estimates. Thus, TD methods are fully incremental and learn on a step-by-step basis. Moreover, the algorithms known as T D() provide a way to add some MC flavor to a TD algorithm by propagating recent information through the most recent path in the state space using the so-called eligibility traces. In both cases, it is possible to learn the value function of one policy, while following another. For example, we can learn the optimal value function, while following an exploratory policy. This is called Off-policy learning. When the two policies match, we talk about On-policy learning. Q-Learning: An Off-policy T D() method Q-learning (Watkins, 1989) is one of the most known and important RL algorithm. Q-learning attempts to learn the Q values of the state-action value function, and thus the name. The idea is as follows: You are currently in state s t. Take action a t under the current policy. Observe the new state s t+1 and the immediate reward r t+1. The previous estimate for (s t ; a t ) was Q(s t ; a t ). The experienced estimate for (s t ; a t ) is r t+1 + max fq(s t+1 ; a)g a New estimate for (s t ; a t ) is a convex combination (1? )Q (t) (s t; a t) + Q (t+1) (s t; a t) = h r t+1 + max a n oi Q (t) (s t+1; a)

4 where ( < 1) is the learning rate. Generalization and Function Approximation Finally, it should be mentioned that there are different ways to represent the value functions. The success or failure of a value function based RL method can depend on that. The naive way is to use a (huge) table to store the values Q(s; a). The upshot is that the table can represent any function and convergence is guaranteed. The downside on the other hand is slow learning. Besides the curse of dimensionality (exponential growth of space space) and handling of discrete spaces only can be severe handicaps to the tabular representation. Another way is to use a function approximator to represent Q(s; a). The intuition is that similar entries (s; a) should have similar values Q(s; a). Depending on the definition on similarity here and the generalization capabilities of the function approximator used, this representation can be a big win. Choices for a function approximator include, but are not limited to state aggregation, polynomial regression, neural networks, etc. The upshot in this case is rapid adaptation, handling of continuous spaces, and, of course, the generalization ability. The main drawback is that convergence is not guaranteed anymore. The Pendulum on the Cart Figure 2 describes the system we study in this paper. A moving cart carries a pendulum that can swing freely around its origin. Assume that the system is built in such a way that the pendulum can complete the circle. ϑ Figure 2: The Pendulum on the Cart This system is described by the following state equations: _x 1 = x 2 _x 2 = g sin(x 1)? amlx 2 2 sin(2x 1)=2? a cos(x 1 )u 4l=3? aml cos 2 (x 1 ) where the variables and parameters are as follows: x 1 : Angle, x 2 : Angular Velocity g : Gravity constant (g = 9:8m=s 2 ) m : Mass of the pendulum (m = 2: kg) M : Mass of the cart (M = 8: kg) u : Force (in Newtons), a; a = 1=(m + M ) It s easy to see that this is a nonlinear system. Controlling such a system to achieve a certain behavior or a certain state is not a trivial task. For our purposes, we ignore the model of the system. If the model were known, there would be no need to apply learning. There exist several powerful techniques (see for example, (Wang, et al., 1996)) that can solve the problem optimally. So, for this work, the model serves only as the simulator of the system dynamics, given the lack of a real cart with a pendulum. Control Tasks We consider two different task for our controller. Both are episodic tasks meaning that eventually they come to an end (either because some time limit passed or some event happened). In this section, we describe the set-up of the tasks as MDPs and the RL methods we used. Task 1 At the beginning of the episode the pendulum is initialized in downward position (angle=) with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and the sign of the angular velocity. Note that the magnitude of the angular velocity is unknown (hidden state). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen small enough so that the only way to move the pendulum higher and higher is by swinging back and forth. The objective is to find a policy that will drive to pendulum past the upward position (at any velocity) in a maximum period of 6 seconds. The system is rewarded by?jj, where is the smallest (discrete) angle from the upward position. Obviously, the reward is proportional to the (angular) distance from the goal and it is maximum () at the upward position. Control actions are taken at discrete time steps (every. seconds) and the remains constant between steps. Task 2 At the beginning of the episode the pendulum is initialized in upward position with some perturbation (angle= + ) and with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and a discretization of the angular velocity in the range [?1; 1] (21 values in total). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen big enough so that they can resist the force due to the weight of the pendulum. The objective is to find a policy that will balance to pendulum close to the upward position and close to a angular velocity for a period of at least 1 seconds. The system is rewarded with for being in the state corresponding to the upward position

5 and zero velocity or when the period of 1 seconds expires without a failure. If the pendulum falls beyond the downward position or the angular velocity exceeds the limit of 1 in absolute value, the system receives a reward of -1. In all other cases a constant reward of -1 is given. Control actions are taken at discrete time steps (every.1 seconds for better control) and the remains constant between steps. Implementation The system (simulator and controller) was implemented in MATLAB. The learning algorithm is Q(), a variant of Q- learning that uses eligibility traces to propagate information to the most recently visited state-action values at once. A learning rate of 4-. was used and exploration probability between.1 and.3. Long traces of about 7-9 were used and no discount factor ( = 1) since the tasks are episodic. Given that the state space is relatively small, a table-based representation is used to store the Q values. studying the differences (and surprises) between real world and simulation. Acknowledgments I would like to thank Prof. Wang for providing the MATLAB code of the model. Selected Bibliography 1. Sutton, R. and Barto A Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, Wang, H. Tanaka, K. and Griffin, M An Approach to Fuzzy Control of Nonlinear Systems: Stability and Design Issues, in IEEE Transactions on Fuzzy Systems, 4, 1, 1996 pp Watkins, C Learning from Delayed Rewards, Ph.D. Thesis, Cambridge University, Results For each case, we provide three diagrams: The state variables over time The trajectory in state space The over time Although learning time was not measured precisely it was between 4 and 6 minutes. Task 1 At the very beginning the system behaves in a more-or-less random way trying to figure out what to do. After some hundreds of trials the goal is achieved, as shown in Figure 3. After some more hundreds of trials and sufficient exploration the controller improves and the task is completed faster now, as shown in Figure 4. Task 2 Again at the beginning the behavior is rather erratic, but after lots of trials the controller manages to keep the pendulum inverted, close to the upward position as shown in Figure. Figure is a blow up of the state space, but it clear from Figure 6 that the pendulum stays inverted. Finally, increasing the episode period to 6 seconds didn t affect the controller that was able to keep control of the pendulum for the entire minute (see Figure 7. Conclusion and Future Work This work and others demonstrate that there is a lot of potential in learning approaches to control. However, there is more to be done before this technology becomes widely useful. Concerning this particular work I am mostly interested in figuring out ways to accelerate learning and make the controller adapt faster. Function approximation might be the choice here. On the other hand I am interested in applying RL methods on a real pendulum-cart system and

6 & & Figure 3: Successful completion of task 1. Figure 4: Better completion of task 1.

7 & & Figure : Successful completion of task 2. Figure 6: Successful completion of task 2.

8 & Figure 7: Even better completion of task 2.

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage. Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part

More information

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado

More information

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning. Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Autonomous Helicopter Flight via Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Abstract Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a

More information

Connecting the Demons

Connecting the Demons Connecting the Demons How connection choices of a Horde implementation affect Demon prediction capabilities. Author: Jasper van der Waa jaspervanderwaa@gmail.com Supervisors: Dr. ir. Martijn van Otterlo

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information