Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Similar documents
Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning II

Lecture 1: March 7, 2018

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

CS599 Lecture 1 Introduction To RL

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning an introduction

Chapter 3: The Reinforcement Learning Problem

Machine Learning I Reinforcement Learning

6 Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

RL 3: Reinforcement Learning

Lecture 23: Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Open Theoretical Questions in Reinforcement Learning

Basics of reinforcement learning

Reinforcement Learning. Machine Learning, Fall 2010

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Introduction to Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Chapter 3: The Reinforcement Learning Problem

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Bias-Variance Error Bounds for Temporal Difference Updates

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

arxiv: v1 [cs.ai] 5 Nov 2017

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Reinforcement Learning: the basics

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning Active Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Q-Learning in Continuous State Action Spaces

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning. George Konidaris

, and rewards and transition matrices as shown below:

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Temporal Difference Learning & Policy Iteration

Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Temporal difference learning

Reinforcement Learning: An Introduction

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

16.410/413 Principles of Autonomy and Decision Making

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Reinforcement learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Generalization and Function Approximation

Autonomous Helicopter Flight via Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

On and Off-Policy Relational Reinforcement Learning

A Gentle Introduction to Reinforcement Learning

Notes on Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning I Continuous Reinforcement Learning

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Procedia Computer Science 00 (2011) 000 6

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Connecting the Demons

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning

The Reinforcement Learning Problem

Reinforcement Learning Part 2

Sequential Decision Problems

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement learning

Reinforcement Learning and Control

Introduction to Reinforcement Learning

CS 7180: Behavioral Modeling and Decisionmaking

The Markov Decision Process Extraction Network

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning

Reinforcement Learning and NLP

ilstd: Eligibility Traces and Convergence Analysis

Reinforcement Learning

Transcription:

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu Abstract We consider the problem of controlling and balancing a freely-swinging pendulum on a moving cart. We assume that no model of the nonlinear system is available. We model the problem as a Markov Decision Process and draw techniques from the field of Reinforcement Learning to implement a learning controller. Although slow, the learning process is demonstrated (in simulation) on two challenging control tasks. Introduction Nonlinear control has became a growing field of study nowadays. Although the developed techniques and methods fall way behind the mature field of linear control, there is much interesting in developing new nonlinear control methods other than via linearization. This is probably due to the fact that most phenomena in the real world are inherently nonlinear and linear methods cannot really describe them. On the other hand, there is always need for flexible and adaptive controllers that can operate in a wide range and under uncertain conditions, which cannot be captured by linear controllers. To this end, a new technology and a whole field, namely that of learning systems, has been developed during the last decades. Adaptive control, artificial neural networks, fuzzy logic, machine learning, intelligent control and learning robotics can probably be placed under this umbrella. In this paper, we focus on one such trend that goes by the name reinforcement learning. In this paper, we propose two control tasks on a nonlinear system and we also assume that no model of the system is given. We then try to develop learning controllers that will achieve the task by trial-and-error. Our experience is that although this process can be time-consuming, it can be very successful. We begin by providing the necessary background for this work: Markov Decision Processes and Reinforcement Learning, looking more closely at the components used in this paper, like the Q-Learning algorithm. We then present the nonlinear system that consists of a pendulum on a moving cart, as well as the model that simulates the dynamics. Subsequently, we define the two control task and provide all the details of setting up the problem as a learning problem. The experimental results demonstrate that our learning system completes successfully both tasks in reasonable time exhibiting reasonable performance. Markov Decision Processes A Markov Decision Process (MDP) consists of: a finite set of states S a finite set of actions A a state transition function T : S A! (S), where (S) is a probability distribution over S and T (s; a; s ) is the probability of making a transition s?! a s. a reward function R : SAS! R, where R(s; a; s ) is the (expected) reward for making the transition s?! a s. Many real-world problems (especially in the Operations Research area) can be formulated under this framework. The distinguishing characteristic is the so called Markov Property, namely the property that state transitions are independent of any previous states and/or actions. We can use a transition graph to visualize an MDP. Figure 1 shows such a graph for an MDP with 2 states (battery level: LOW and HIGH) and 3 actions (robot: Wait, Search, Recharge). Transition probabilities and rewards are shown on the edges. The task here refers to a recycling robot that can choose among searching for an empty can, waiting, or reacharging the battery depending on its battery level. 1, R wait α, R search wait high search 1 β, 3 1, recharge 1 α, R search search low wait β, R search 1, R wait Figure 1: The Recycling Robot (Sutton and Barto, 1998)

A deterministic policy for an MDP is a mapping : S! A. A stochastic policy is a mapping : S! (A), where (A) is a probability distribution over A. Stochastic policies are also called soft, for they do not commit to a single action per state. We use (s; a) for the probability that policy chooses action a in state s. An example would be the (1? ) soft policy, for some < < 1, which picks a particular action with probability 1? and picks randomly with probability epsilon. Finally, an optimal policy for an MDP is a policy that maximizes 1 the expected total reward over time. Formally: = arg max E (R ) = arg max E h t= t R(t) where h determines the horizon (how long is the process) and is the discount rate (relates to the present value of future rewards). For episodic tasks the horizon is finite, h < 1, and < 1. For continuing tasks h = 1 and < < 1. Value Functions Given an MDP the state value function assigns a value to each state. The value V (s) of a state s under a policy is the expected return when starting in state s and following thereafter. Formally, V (s) = E (R t js t = s) Similarly, the state-action value function assigns a value to each pair (state, action). The value Q (s; a) of taking action a in state s under a policy is the expected return starting from s, taking a, and following thereafter. Q (s; a) = E (R t js t = s; a t = a) Notice that the state and the state-action value functions are related as follows: Q (s; a) = V (s) = s 2S a2a (s; a)q (s; a) T (s; a; s ) [R(s; a; s ) + V (s )] By substituting the expressions above into each other we obtain the following recurrent equations for the value functions, known as the Bellman equations: 2 V (s) = (s; a) 4 3 T (s; a; s ) R(s; a; s ) + V (s ) a2a s 2S 2 3 4 R(s; a; s ) + Q (s; a) = s 2S T (s; a; s ) a 2A! (s ; a )Q (s ; a ) It is easy to see that the equations above are linear in V (s) or Q (s; a) and the exact values can be obtain by solving the resulting system of linear equations. The optimal value function is the one that corresponds to the optimal policy. It is true, in this case, that the value of each state or state-action pair is maximized. V (s) = max V (s) Q (s; a) = max Q (s; a) 1 There is a similar definition for minimization problems. It has been proven that for any MDP there exists a deterministic optimal policy. Given that we can derive an expression for the optimal value function from the equations above, by always selecting (with probability 1) the action that maximizes the value. This results to the Bellman optimality equations: V (s) = max a2a Q (s; a) = s 2S 8 < : s 2S T (s; a; s ) h i 9 = T (s; a; s ) R(s; a; s ) + V (s ) ; R(s; a; s ) + max a 2A n Q (s ; a ) Unfortunately, these are nonlinear equations and cannot be solved analytically. There is, however, a dynamic programming algorithm, known as value iteration that iteratively approximates the optimal value function. Why is the optimal value function so important. Simply, because given the optimal value function we can easily derive the optimal policy for the MDP. It is known that the optimal policy is deterministic and greedy with respect to the optimal value function. That means that it picks only one action per state and actually the one that locally maximizes the value function. Formally, in state s, pick action a, such that: a = arg max a2a ( s 2S o h i ) T (s; a; s ) R(s; a; s ) + V (s ) a = arg max Q (s; a) a2a The beauty of the value function is that it turns a difficult global optimization problem over the long term (finding the optimal policy) to a simple local search problem. Notice that given V (s) we need the environment dynamics (T and R) in order to determine the optimal policy, whereas given Q (s; a) we need no additional information. This is a crucial feature that will be exploited later for model-free learning. However, it comes at the expense of a more complicated (2-dimensional) function. Solving MDPs There is a number of methods to solve an MDP, i.e. to determine the optimal policy. We briefly mention four of them. Value Iteration In this case, we try to find (or rather, approximate) the optimal value function by solving the Bellman optimality equations. The optimal policy can be easily found then. The idea is to start with an arbitrary value function and iteratively update the values using the Bellman optimality equations until there is no change. Eventually, the value function converges to the optimal. The cost is O(jAjjSj 2 ) per iteration, but the number of iterations can grow exponentially in the discount factor. Policy Iteration In this case, we manipulate the policy directly. Start with an arbitrary policy, evaluate it (solve the linear Bellman equations), improve it (make it greedy), and repeat until there is no change in the policy (converges to the optimal one). The cost in this case is O(jAjjSj 2 + jsj 3 ) per iteration, but the number of iterations is bounded by jaj jsj,

the total number of all possible distinct policies. Unfortunately, there is no proof that it can be that big! Modified Policy Iteration This method combines the previous two. Instead of solving exactly the Bellman equations to evaluate an intermediate policy (jsj 3 ), one can run a few steps of value iteration? O(jAjjSj 2 ) to just approximate it, the improve the policy and repeat. It turns out that the method converges to the optimal policy and is very efficient in practice. Linear Programming Finally, MDPs can be recast as Linear Programming problems. This is the only known polynomial algorithm, although it is not always efficient in practice. Reinforcement Learning So far we have assumed that a model is given and the dynamics are known. Uncertainty, however, is inherent in the real world. Sensors are limited and/or inaccurate, measurements cannot be taken at all times, many systems are unpredictable, etc. It is the case that for many interesting problems, the transition probabilities T (s; a; s ) or the reward function R(s; a; s ) are unavailable. Can we still make good decisions in such a case? On what basis? Let s begin from what is available. We assume that we have complete observability of the state, that is, we know where the state we are in with certainty 2. Besides, we can sample the unknown functions T (s; a; s ) and R(s; a; s ) through interaction with the environment. Based on that experience, i.e. reinforcements from the environment, can we learn to act optimally in the world? That leads us to the field of Reinforcement Learning (RL) that has recently received significant attention. It is better to think of RL as class of problems, rather than as a class of methods. These are exactly the problems where a complete, interactive, goal-seeking agent learn to act in its environment by trial-and-error. There are several issues associated with RL problems. We focus on three of them: Delayed Rewards In many problems, reward is received only after long periods. Consider chess or any similar game. The agent cannot be rewarded before the end of the game; only then the outcome is known. Credit assignment problem This problem has two aspect. Assuming that the agent receives a big amount of reward (or punishment) which one of the most recent (sequential) decisions was responsible for that (temporal credit assignment). Also, what of the system is responsible for that (structural credit assignment). Exploration and Exploitation Since the agent is not given information in advance it has to learn by exploration. Exploration results in some knowledge that can be exploited 2 The cases where there is hidden state information are formulated as Partially Observable Markov Decision Processes (POMDPs). We do not discuss POMDPs here, but we consider a case with hidden state later. to gain more and more reward. But there might be better ways to get rewarded. So, exploration and exploitation pull toward different directions. How can the agent balance the trade-off between them? Given the situation that an RL agent faces, there are two general ways to go: Model-Based Approaches Learn a model of the process and use the model to derive an optimal policy. This corresponds to the so-called indirect control. Model-Free Approaches Learn a policy without learning a model. This corresponds to the so-called direct control. For this work we focus on model-free methods for RL problems. Such methods mostly proceeded by trying to estimate (learn) the optimal state-action value function of the MDP. If this is possible, we can figure out the optimal policy without a model. Two broad classes of such methods are the following: Monte-Carlo (MC) Methods Monte-Carlo methods estimate the values by sampling total rewards over long runs starting from particular state-action pairs. Given enough data this leads to a very good approximation. By using exploring starts or a stochastic policy the agent maintains exploration during learning. Temporal Difference (TD) Methods In this case, the agent estimates the values based on sample rewards and previous estimates. Thus, TD methods are fully incremental and learn on a step-by-step basis. Moreover, the algorithms known as T D() provide a way to add some MC flavor to a TD algorithm by propagating recent information through the most recent path in the state space using the so-called eligibility traces. In both cases, it is possible to learn the value function of one policy, while following another. For example, we can learn the optimal value function, while following an exploratory policy. This is called Off-policy learning. When the two policies match, we talk about On-policy learning. Q-Learning: An Off-policy T D() method Q-learning (Watkins, 1989) is one of the most known and important RL algorithm. Q-learning attempts to learn the Q values of the state-action value function, and thus the name. The idea is as follows: You are currently in state s t. Take action a t under the current policy. Observe the new state s t+1 and the immediate reward r t+1. The previous estimate for (s t ; a t ) was Q(s t ; a t ). The experienced estimate for (s t ; a t ) is r t+1 + max fq(s t+1 ; a)g a New estimate for (s t ; a t ) is a convex combination (1? )Q (t) (s t; a t) + Q (t+1) (s t; a t) = h r t+1 + max a n oi Q (t) (s t+1; a)

where ( < 1) is the learning rate. Generalization and Function Approximation Finally, it should be mentioned that there are different ways to represent the value functions. The success or failure of a value function based RL method can depend on that. The naive way is to use a (huge) table to store the values Q(s; a). The upshot is that the table can represent any function and convergence is guaranteed. The downside on the other hand is slow learning. Besides the curse of dimensionality (exponential growth of space space) and handling of discrete spaces only can be severe handicaps to the tabular representation. Another way is to use a function approximator to represent Q(s; a). The intuition is that similar entries (s; a) should have similar values Q(s; a). Depending on the definition on similarity here and the generalization capabilities of the function approximator used, this representation can be a big win. Choices for a function approximator include, but are not limited to state aggregation, polynomial regression, neural networks, etc. The upshot in this case is rapid adaptation, handling of continuous spaces, and, of course, the generalization ability. The main drawback is that convergence is not guaranteed anymore. The Pendulum on the Cart Figure 2 describes the system we study in this paper. A moving cart carries a pendulum that can swing freely around its origin. Assume that the system is built in such a way that the pendulum can complete the circle. ϑ Figure 2: The Pendulum on the Cart This system is described by the following state equations: _x 1 = x 2 _x 2 = g sin(x 1)? amlx 2 2 sin(2x 1)=2? a cos(x 1 )u 4l=3? aml cos 2 (x 1 ) where the variables and parameters are as follows: x 1 : Angle, x 2 : Angular Velocity g : Gravity constant (g = 9:8m=s 2 ) m : Mass of the pendulum (m = 2: kg) M : Mass of the cart (M = 8: kg) u : Force (in Newtons), a; a = 1=(m + M ) It s easy to see that this is a nonlinear system. Controlling such a system to achieve a certain behavior or a certain state is not a trivial task. For our purposes, we ignore the model of the system. If the model were known, there would be no need to apply learning. There exist several powerful techniques (see for example, (Wang, et al., 1996)) that can solve the problem optimally. So, for this work, the model serves only as the simulator of the system dynamics, given the lack of a real cart with a pendulum. Control Tasks We consider two different task for our controller. Both are episodic tasks meaning that eventually they come to an end (either because some time limit passed or some event happened). In this section, we describe the set-up of the tasks as MDPs and the RL methods we used. Task 1 At the beginning of the episode the pendulum is initialized in downward position (angle=) with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and the sign of the angular velocity. Note that the magnitude of the angular velocity is unknown (hidden state). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen small enough so that the only way to move the pendulum higher and higher is by swinging back and forth. The objective is to find a policy that will drive to pendulum past the upward position (at any velocity) in a maximum period of 6 seconds. The system is rewarded by?jj, where is the smallest (discrete) angle from the upward position. Obviously, the reward is proportional to the (angular) distance from the goal and it is maximum () at the upward position. Control actions are taken at discrete time steps (every. seconds) and the remains constant between steps. Task 2 At the beginning of the episode the pendulum is initialized in upward position with some perturbation (angle= + ) and with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and a discretization of the angular velocity in the range [?1; 1] (21 values in total). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen big enough so that they can resist the force due to the weight of the pendulum. The objective is to find a policy that will balance to pendulum close to the upward position and close to a angular velocity for a period of at least 1 seconds. The system is rewarded with for being in the state corresponding to the upward position

and zero velocity or when the period of 1 seconds expires without a failure. If the pendulum falls beyond the downward position or the angular velocity exceeds the limit of 1 in absolute value, the system receives a reward of -1. In all other cases a constant reward of -1 is given. Control actions are taken at discrete time steps (every.1 seconds for better control) and the remains constant between steps. Implementation The system (simulator and controller) was implemented in MATLAB. The learning algorithm is Q(), a variant of Q- learning that uses eligibility traces to propagate information to the most recently visited state-action values at once. A learning rate of 4-. was used and exploration probability between.1 and.3. Long traces of about 7-9 were used and no discount factor ( = 1) since the tasks are episodic. Given that the state space is relatively small, a table-based representation is used to store the Q values. studying the differences (and surprises) between real world and simulation. Acknowledgments I would like to thank Prof. Wang for providing the MATLAB code of the model. Selected Bibliography 1. Sutton, R. and Barto A. 1998. Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998. 2. Wang, H. Tanaka, K. and Griffin, M. 1996. An Approach to Fuzzy Control of Nonlinear Systems: Stability and Design Issues, in IEEE Transactions on Fuzzy Systems, 4, 1, 1996 pp. 14-23. 3. Watkins, C. 1989. Learning from Delayed Rewards, Ph.D. Thesis, Cambridge University, 1989. Results For each case, we provide three diagrams: The state variables over time The trajectory in state space The over time Although learning time was not measured precisely it was between 4 and 6 minutes. Task 1 At the very beginning the system behaves in a more-or-less random way trying to figure out what to do. After some hundreds of trials the goal is achieved, as shown in Figure 3. After some more hundreds of trials and sufficient exploration the controller improves and the task is completed faster now, as shown in Figure 4. Task 2 Again at the beginning the behavior is rather erratic, but after lots of trials the controller manages to keep the pendulum inverted, close to the upward position as shown in Figure. Figure is a blow up of the state space, but it clear from Figure 6 that the pendulum stays inverted. Finally, increasing the episode period to 6 seconds didn t affect the controller that was able to keep control of the pendulum for the entire minute (see Figure 7. Conclusion and Future Work This work and others demonstrate that there is a lot of potential in learning approaches to control. However, there is more to be done before this technology becomes widely useful. Concerning this particular work I am mostly interested in figuring out ways to accelerate learning and make the controller adapt faster. Function approximation might be the choice here. On the other hand I am interested in applying RL methods on a real pendulum-cart system and

& 1 1 1 2 3 4 6 1 & 1 1 1 2 3 4 6 1 1 1 1 2 3 4 6 1 1 1 2 3 4 6 7 1 2 3 4 6 1 2 3 4 6 Figure 3: Successful completion of task 1. Figure 4: Better completion of task 1.

& 1 1 2 1 2 3 4 6 7 8 9 1 1 & 1 1 1 2 3 4 6 7 8 9 1 1 1 2.4.3.2.1.1.2.3.4 1 6 4 2 2 4 6 1 2 3 4 6 7 8 9 1 1 2 3 4 6 7 8 9 1 Figure : Successful completion of task 2. Figure 6: Successful completion of task 2.

& 1 1 1 2 3 4 6 1 1 6 4 2 2 4 6 1 2 3 4 6 Figure 7: Even better completion of task 2.