Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
|
|
- Vivien Bradley
- 5 years ago
- Views:
Transcription
1 Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu Abstract We consider the problem of controlling and balancing a freely-swinging pendulum on a moving cart. We assume that no model of the nonlinear system is available. We model the problem as a Markov Decision Process and draw techniques from the field of Reinforcement Learning to implement a learning controller. Although slow, the learning process is demonstrated (in simulation) on two challenging control tasks. Introduction Nonlinear control has became a growing field of study nowadays. Although the developed techniques and methods fall way behind the mature field of linear control, there is much interesting in developing new nonlinear control methods other than via linearization. This is probably due to the fact that most phenomena in the real world are inherently nonlinear and linear methods cannot really describe them. On the other hand, there is always need for flexible and adaptive controllers that can operate in a wide range and under uncertain conditions, which cannot be captured by linear controllers. To this end, a new technology and a whole field, namely that of learning systems, has been developed during the last decades. Adaptive control, artificial neural networks, fuzzy logic, machine learning, intelligent control and learning robotics can probably be placed under this umbrella. In this paper, we focus on one such trend that goes by the name reinforcement learning. In this paper, we propose two control tasks on a nonlinear system and we also assume that no model of the system is given. We then try to develop learning controllers that will achieve the task by trial-and-error. Our experience is that although this process can be time-consuming, it can be very successful. We begin by providing the necessary background for this work: Markov Decision Processes and Reinforcement Learning, looking more closely at the components used in this paper, like the Q-Learning algorithm. We then present the nonlinear system that consists of a pendulum on a moving cart, as well as the model that simulates the dynamics. Subsequently, we define the two control task and provide all the details of setting up the problem as a learning problem. The experimental results demonstrate that our learning system completes successfully both tasks in reasonable time exhibiting reasonable performance. Markov Decision Processes A Markov Decision Process (MDP) consists of: a finite set of states S a finite set of actions A a state transition function T : S A! (S), where (S) is a probability distribution over S and T (s; a; s ) is the probability of making a transition s?! a s. a reward function R : SAS! R, where R(s; a; s ) is the (expected) reward for making the transition s?! a s. Many real-world problems (especially in the Operations Research area) can be formulated under this framework. The distinguishing characteristic is the so called Markov Property, namely the property that state transitions are independent of any previous states and/or actions. We can use a transition graph to visualize an MDP. Figure 1 shows such a graph for an MDP with 2 states (battery level: LOW and HIGH) and 3 actions (robot: Wait, Search, Recharge). Transition probabilities and rewards are shown on the edges. The task here refers to a recycling robot that can choose among searching for an empty can, waiting, or reacharging the battery depending on its battery level. 1, R wait α, R search wait high search 1 β, 3 1, recharge 1 α, R search search low wait β, R search 1, R wait Figure 1: The Recycling Robot (Sutton and Barto, 1998)
2 A deterministic policy for an MDP is a mapping : S! A. A stochastic policy is a mapping : S! (A), where (A) is a probability distribution over A. Stochastic policies are also called soft, for they do not commit to a single action per state. We use (s; a) for the probability that policy chooses action a in state s. An example would be the (1? ) soft policy, for some < < 1, which picks a particular action with probability 1? and picks randomly with probability epsilon. Finally, an optimal policy for an MDP is a policy that maximizes 1 the expected total reward over time. Formally: = arg max E (R ) = arg max E h t= t R(t) where h determines the horizon (how long is the process) and is the discount rate (relates to the present value of future rewards). For episodic tasks the horizon is finite, h < 1, and < 1. For continuing tasks h = 1 and < < 1. Value Functions Given an MDP the state value function assigns a value to each state. The value V (s) of a state s under a policy is the expected return when starting in state s and following thereafter. Formally, V (s) = E (R t js t = s) Similarly, the state-action value function assigns a value to each pair (state, action). The value Q (s; a) of taking action a in state s under a policy is the expected return starting from s, taking a, and following thereafter. Q (s; a) = E (R t js t = s; a t = a) Notice that the state and the state-action value functions are related as follows: Q (s; a) = V (s) = s 2S a2a (s; a)q (s; a) T (s; a; s ) [R(s; a; s ) + V (s )] By substituting the expressions above into each other we obtain the following recurrent equations for the value functions, known as the Bellman equations: 2 V (s) = (s; a) 4 3 T (s; a; s ) R(s; a; s ) + V (s ) a2a s 2S R(s; a; s ) + Q (s; a) = s 2S T (s; a; s ) a 2A! (s ; a )Q (s ; a ) It is easy to see that the equations above are linear in V (s) or Q (s; a) and the exact values can be obtain by solving the resulting system of linear equations. The optimal value function is the one that corresponds to the optimal policy. It is true, in this case, that the value of each state or state-action pair is maximized. V (s) = max V (s) Q (s; a) = max Q (s; a) 1 There is a similar definition for minimization problems. It has been proven that for any MDP there exists a deterministic optimal policy. Given that we can derive an expression for the optimal value function from the equations above, by always selecting (with probability 1) the action that maximizes the value. This results to the Bellman optimality equations: V (s) = max a2a Q (s; a) = s 2S 8 < : s 2S T (s; a; s ) h i 9 = T (s; a; s ) R(s; a; s ) + V (s ) ; R(s; a; s ) + max a 2A n Q (s ; a ) Unfortunately, these are nonlinear equations and cannot be solved analytically. There is, however, a dynamic programming algorithm, known as value iteration that iteratively approximates the optimal value function. Why is the optimal value function so important. Simply, because given the optimal value function we can easily derive the optimal policy for the MDP. It is known that the optimal policy is deterministic and greedy with respect to the optimal value function. That means that it picks only one action per state and actually the one that locally maximizes the value function. Formally, in state s, pick action a, such that: a = arg max a2a ( s 2S o h i ) T (s; a; s ) R(s; a; s ) + V (s ) a = arg max Q (s; a) a2a The beauty of the value function is that it turns a difficult global optimization problem over the long term (finding the optimal policy) to a simple local search problem. Notice that given V (s) we need the environment dynamics (T and R) in order to determine the optimal policy, whereas given Q (s; a) we need no additional information. This is a crucial feature that will be exploited later for model-free learning. However, it comes at the expense of a more complicated (2-dimensional) function. Solving MDPs There is a number of methods to solve an MDP, i.e. to determine the optimal policy. We briefly mention four of them. Value Iteration In this case, we try to find (or rather, approximate) the optimal value function by solving the Bellman optimality equations. The optimal policy can be easily found then. The idea is to start with an arbitrary value function and iteratively update the values using the Bellman optimality equations until there is no change. Eventually, the value function converges to the optimal. The cost is O(jAjjSj 2 ) per iteration, but the number of iterations can grow exponentially in the discount factor. Policy Iteration In this case, we manipulate the policy directly. Start with an arbitrary policy, evaluate it (solve the linear Bellman equations), improve it (make it greedy), and repeat until there is no change in the policy (converges to the optimal one). The cost in this case is O(jAjjSj 2 + jsj 3 ) per iteration, but the number of iterations is bounded by jaj jsj,
3 the total number of all possible distinct policies. Unfortunately, there is no proof that it can be that big! Modified Policy Iteration This method combines the previous two. Instead of solving exactly the Bellman equations to evaluate an intermediate policy (jsj 3 ), one can run a few steps of value iteration? O(jAjjSj 2 ) to just approximate it, the improve the policy and repeat. It turns out that the method converges to the optimal policy and is very efficient in practice. Linear Programming Finally, MDPs can be recast as Linear Programming problems. This is the only known polynomial algorithm, although it is not always efficient in practice. Reinforcement Learning So far we have assumed that a model is given and the dynamics are known. Uncertainty, however, is inherent in the real world. Sensors are limited and/or inaccurate, measurements cannot be taken at all times, many systems are unpredictable, etc. It is the case that for many interesting problems, the transition probabilities T (s; a; s ) or the reward function R(s; a; s ) are unavailable. Can we still make good decisions in such a case? On what basis? Let s begin from what is available. We assume that we have complete observability of the state, that is, we know where the state we are in with certainty 2. Besides, we can sample the unknown functions T (s; a; s ) and R(s; a; s ) through interaction with the environment. Based on that experience, i.e. reinforcements from the environment, can we learn to act optimally in the world? That leads us to the field of Reinforcement Learning (RL) that has recently received significant attention. It is better to think of RL as class of problems, rather than as a class of methods. These are exactly the problems where a complete, interactive, goal-seeking agent learn to act in its environment by trial-and-error. There are several issues associated with RL problems. We focus on three of them: Delayed Rewards In many problems, reward is received only after long periods. Consider chess or any similar game. The agent cannot be rewarded before the end of the game; only then the outcome is known. Credit assignment problem This problem has two aspect. Assuming that the agent receives a big amount of reward (or punishment) which one of the most recent (sequential) decisions was responsible for that (temporal credit assignment). Also, what of the system is responsible for that (structural credit assignment). Exploration and Exploitation Since the agent is not given information in advance it has to learn by exploration. Exploration results in some knowledge that can be exploited 2 The cases where there is hidden state information are formulated as Partially Observable Markov Decision Processes (POMDPs). We do not discuss POMDPs here, but we consider a case with hidden state later. to gain more and more reward. But there might be better ways to get rewarded. So, exploration and exploitation pull toward different directions. How can the agent balance the trade-off between them? Given the situation that an RL agent faces, there are two general ways to go: Model-Based Approaches Learn a model of the process and use the model to derive an optimal policy. This corresponds to the so-called indirect control. Model-Free Approaches Learn a policy without learning a model. This corresponds to the so-called direct control. For this work we focus on model-free methods for RL problems. Such methods mostly proceeded by trying to estimate (learn) the optimal state-action value function of the MDP. If this is possible, we can figure out the optimal policy without a model. Two broad classes of such methods are the following: Monte-Carlo (MC) Methods Monte-Carlo methods estimate the values by sampling total rewards over long runs starting from particular state-action pairs. Given enough data this leads to a very good approximation. By using exploring starts or a stochastic policy the agent maintains exploration during learning. Temporal Difference (TD) Methods In this case, the agent estimates the values based on sample rewards and previous estimates. Thus, TD methods are fully incremental and learn on a step-by-step basis. Moreover, the algorithms known as T D() provide a way to add some MC flavor to a TD algorithm by propagating recent information through the most recent path in the state space using the so-called eligibility traces. In both cases, it is possible to learn the value function of one policy, while following another. For example, we can learn the optimal value function, while following an exploratory policy. This is called Off-policy learning. When the two policies match, we talk about On-policy learning. Q-Learning: An Off-policy T D() method Q-learning (Watkins, 1989) is one of the most known and important RL algorithm. Q-learning attempts to learn the Q values of the state-action value function, and thus the name. The idea is as follows: You are currently in state s t. Take action a t under the current policy. Observe the new state s t+1 and the immediate reward r t+1. The previous estimate for (s t ; a t ) was Q(s t ; a t ). The experienced estimate for (s t ; a t ) is r t+1 + max fq(s t+1 ; a)g a New estimate for (s t ; a t ) is a convex combination (1? )Q (t) (s t; a t) + Q (t+1) (s t; a t) = h r t+1 + max a n oi Q (t) (s t+1; a)
4 where ( < 1) is the learning rate. Generalization and Function Approximation Finally, it should be mentioned that there are different ways to represent the value functions. The success or failure of a value function based RL method can depend on that. The naive way is to use a (huge) table to store the values Q(s; a). The upshot is that the table can represent any function and convergence is guaranteed. The downside on the other hand is slow learning. Besides the curse of dimensionality (exponential growth of space space) and handling of discrete spaces only can be severe handicaps to the tabular representation. Another way is to use a function approximator to represent Q(s; a). The intuition is that similar entries (s; a) should have similar values Q(s; a). Depending on the definition on similarity here and the generalization capabilities of the function approximator used, this representation can be a big win. Choices for a function approximator include, but are not limited to state aggregation, polynomial regression, neural networks, etc. The upshot in this case is rapid adaptation, handling of continuous spaces, and, of course, the generalization ability. The main drawback is that convergence is not guaranteed anymore. The Pendulum on the Cart Figure 2 describes the system we study in this paper. A moving cart carries a pendulum that can swing freely around its origin. Assume that the system is built in such a way that the pendulum can complete the circle. ϑ Figure 2: The Pendulum on the Cart This system is described by the following state equations: _x 1 = x 2 _x 2 = g sin(x 1)? amlx 2 2 sin(2x 1)=2? a cos(x 1 )u 4l=3? aml cos 2 (x 1 ) where the variables and parameters are as follows: x 1 : Angle, x 2 : Angular Velocity g : Gravity constant (g = 9:8m=s 2 ) m : Mass of the pendulum (m = 2: kg) M : Mass of the cart (M = 8: kg) u : Force (in Newtons), a; a = 1=(m + M ) It s easy to see that this is a nonlinear system. Controlling such a system to achieve a certain behavior or a certain state is not a trivial task. For our purposes, we ignore the model of the system. If the model were known, there would be no need to apply learning. There exist several powerful techniques (see for example, (Wang, et al., 1996)) that can solve the problem optimally. So, for this work, the model serves only as the simulator of the system dynamics, given the lack of a real cart with a pendulum. Control Tasks We consider two different task for our controller. Both are episodic tasks meaning that eventually they come to an end (either because some time limit passed or some event happened). In this section, we describe the set-up of the tasks as MDPs and the RL methods we used. Task 1 At the beginning of the episode the pendulum is initialized in downward position (angle=) with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and the sign of the angular velocity. Note that the magnitude of the angular velocity is unknown (hidden state). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen small enough so that the only way to move the pendulum higher and higher is by swinging back and forth. The objective is to find a policy that will drive to pendulum past the upward position (at any velocity) in a maximum period of 6 seconds. The system is rewarded by?jj, where is the smallest (discrete) angle from the upward position. Obviously, the reward is proportional to the (angular) distance from the goal and it is maximum () at the upward position. Control actions are taken at discrete time steps (every. seconds) and the remains constant between steps. Task 2 At the beginning of the episode the pendulum is initialized in upward position with some perturbation (angle= + ) and with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and a discretization of the angular velocity in the range [?1; 1] (21 values in total). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen big enough so that they can resist the force due to the weight of the pendulum. The objective is to find a policy that will balance to pendulum close to the upward position and close to a angular velocity for a period of at least 1 seconds. The system is rewarded with for being in the state corresponding to the upward position
5 and zero velocity or when the period of 1 seconds expires without a failure. If the pendulum falls beyond the downward position or the angular velocity exceeds the limit of 1 in absolute value, the system receives a reward of -1. In all other cases a constant reward of -1 is given. Control actions are taken at discrete time steps (every.1 seconds for better control) and the remains constant between steps. Implementation The system (simulator and controller) was implemented in MATLAB. The learning algorithm is Q(), a variant of Q- learning that uses eligibility traces to propagate information to the most recently visited state-action values at once. A learning rate of 4-. was used and exploration probability between.1 and.3. Long traces of about 7-9 were used and no discount factor ( = 1) since the tasks are episodic. Given that the state space is relatively small, a table-based representation is used to store the Q values. studying the differences (and surprises) between real world and simulation. Acknowledgments I would like to thank Prof. Wang for providing the MATLAB code of the model. Selected Bibliography 1. Sutton, R. and Barto A Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, Wang, H. Tanaka, K. and Griffin, M An Approach to Fuzzy Control of Nonlinear Systems: Stability and Design Issues, in IEEE Transactions on Fuzzy Systems, 4, 1, 1996 pp Watkins, C Learning from Delayed Rewards, Ph.D. Thesis, Cambridge University, Results For each case, we provide three diagrams: The state variables over time The trajectory in state space The over time Although learning time was not measured precisely it was between 4 and 6 minutes. Task 1 At the very beginning the system behaves in a more-or-less random way trying to figure out what to do. After some hundreds of trials the goal is achieved, as shown in Figure 3. After some more hundreds of trials and sufficient exploration the controller improves and the task is completed faster now, as shown in Figure 4. Task 2 Again at the beginning the behavior is rather erratic, but after lots of trials the controller manages to keep the pendulum inverted, close to the upward position as shown in Figure. Figure is a blow up of the state space, but it clear from Figure 6 that the pendulum stays inverted. Finally, increasing the episode period to 6 seconds didn t affect the controller that was able to keep control of the pendulum for the entire minute (see Figure 7. Conclusion and Future Work This work and others demonstrate that there is a lot of potential in learning approaches to control. However, there is more to be done before this technology becomes widely useful. Concerning this particular work I am mostly interested in figuring out ways to accelerate learning and make the controller adapt faster. Function approximation might be the choice here. On the other hand I am interested in applying RL methods on a real pendulum-cart system and
6 & & Figure 3: Successful completion of task 1. Figure 4: Better completion of task 1.
7 & & Figure : Successful completion of task 2. Figure 6: Successful completion of task 2.
8 & Figure 7: Even better completion of task 2.
Introduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationChapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationRL 3: Reinforcement Learning
RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationBias-Variance Error Bounds for Temporal Difference Updates
Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationarxiv: v1 [cs.ai] 5 Nov 2017
arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important
More informationCourse basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.
Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationReinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationReview: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]
Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More information16.410/413 Principles of Autonomy and Decision Making
16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December
More informationSequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague
Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationThe Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo
CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part
More informationReinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil
Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado
More informationMonte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.
Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationGeneralization and Function Approximation
Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
More informationAutonomous Helicopter Flight via Reinforcement Learning
Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationOn and Off-Policy Relational Reinforcement Learning
On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationReading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationAdaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning
Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Abstract Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a
More informationConnecting the Demons
Connecting the Demons How connection choices of a Horde implementation affect Demon prediction capabilities. Author: Jasper van der Waa jaspervanderwaa@gmail.com Supervisors: Dr. ir. Martijn van Otterlo
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationThe Reinforcement Learning Problem
The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationSequential Decision Problems
Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationThe Markov Decision Process Extraction Network
The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More information