Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu Abstract We consider the problem of controlling and balancing a freely-swinging pendulum on a moving cart. We assume that no model of the nonlinear system is available. We model the problem as a Markov Decision Process and draw techniques from the field of Reinforcement Learning to implement a learning controller. Although slow, the learning process is demonstrated (in simulation) on two challenging control tasks. Introduction Nonlinear control has became a growing field of study nowadays. Although the developed techniques and methods fall way behind the mature field of linear control, there is much interesting in developing new nonlinear control methods other than via linearization. This is probably due to the fact that most phenomena in the real world are inherently nonlinear and linear methods cannot really describe them. On the other hand, there is always need for flexible and adaptive controllers that can operate in a wide range and under uncertain conditions, which cannot be captured by linear controllers. To this end, a new technology and a whole field, namely that of learning systems, has been developed during the last decades. Adaptive control, artificial neural networks, fuzzy logic, machine learning, intelligent control and learning robotics can probably be placed under this umbrella. In this paper, we focus on one such trend that goes by the name reinforcement learning. In this paper, we propose two control tasks on a nonlinear system and we also assume that no model of the system is given. We then try to develop learning controllers that will achieve the task by trial-and-error. Our experience is that although this process can be time-consuming, it can be very successful. We begin by providing the necessary background for this work: Markov Decision Processes and Reinforcement Learning, looking more closely at the components used in this paper, like the Q-Learning algorithm. We then present the nonlinear system that consists of a pendulum on a moving cart, as well as the model that simulates the dynamics. Subsequently, we define the two control task and provide all the details of setting up the problem as a learning problem. The experimental results demonstrate that our learning system completes successfully both tasks in reasonable time exhibiting reasonable performance. Markov Decision Processes A Markov Decision Process (MDP) consists of: a finite set of states S a finite set of actions A a state transition function T : S A! (S), where (S) is a probability distribution over S and T (s; a; s ) is the probability of making a transition s?! a s. a reward function R : SAS! R, where R(s; a; s ) is the (expected) reward for making the transition s?! a s. Many real-world problems (especially in the Operations Research area) can be formulated under this framework. The distinguishing characteristic is the so called Markov Property, namely the property that state transitions are independent of any previous states and/or actions. We can use a transition graph to visualize an MDP. Figure 1 shows such a graph for an MDP with 2 states (battery level: LOW and HIGH) and 3 actions (robot: Wait, Search, Recharge). Transition probabilities and rewards are shown on the edges. The task here refers to a recycling robot that can choose among searching for an empty can, waiting, or reacharging the battery depending on its battery level. 1, R wait α, R search wait high search 1 β, 3 1, recharge 1 α, R search search low wait β, R search 1, R wait Figure 1: The Recycling Robot (Sutton and Barto, 1998)
A deterministic policy for an MDP is a mapping : S! A. A stochastic policy is a mapping : S! (A), where (A) is a probability distribution over A. Stochastic policies are also called soft, for they do not commit to a single action per state. We use (s; a) for the probability that policy chooses action a in state s. An example would be the (1? ) soft policy, for some < < 1, which picks a particular action with probability 1? and picks randomly with probability epsilon. Finally, an optimal policy for an MDP is a policy that maximizes 1 the expected total reward over time. Formally: = arg max E (R ) = arg max E h t= t R(t) where h determines the horizon (how long is the process) and is the discount rate (relates to the present value of future rewards). For episodic tasks the horizon is finite, h < 1, and < 1. For continuing tasks h = 1 and < < 1. Value Functions Given an MDP the state value function assigns a value to each state. The value V (s) of a state s under a policy is the expected return when starting in state s and following thereafter. Formally, V (s) = E (R t js t = s) Similarly, the state-action value function assigns a value to each pair (state, action). The value Q (s; a) of taking action a in state s under a policy is the expected return starting from s, taking a, and following thereafter. Q (s; a) = E (R t js t = s; a t = a) Notice that the state and the state-action value functions are related as follows: Q (s; a) = V (s) = s 2S a2a (s; a)q (s; a) T (s; a; s ) [R(s; a; s ) + V (s )] By substituting the expressions above into each other we obtain the following recurrent equations for the value functions, known as the Bellman equations: 2 V (s) = (s; a) 4 3 T (s; a; s ) R(s; a; s ) + V (s ) a2a s 2S 2 3 4 R(s; a; s ) + Q (s; a) = s 2S T (s; a; s ) a 2A! (s ; a )Q (s ; a ) It is easy to see that the equations above are linear in V (s) or Q (s; a) and the exact values can be obtain by solving the resulting system of linear equations. The optimal value function is the one that corresponds to the optimal policy. It is true, in this case, that the value of each state or state-action pair is maximized. V (s) = max V (s) Q (s; a) = max Q (s; a) 1 There is a similar definition for minimization problems. It has been proven that for any MDP there exists a deterministic optimal policy. Given that we can derive an expression for the optimal value function from the equations above, by always selecting (with probability 1) the action that maximizes the value. This results to the Bellman optimality equations: V (s) = max a2a Q (s; a) = s 2S 8 < : s 2S T (s; a; s ) h i 9 = T (s; a; s ) R(s; a; s ) + V (s ) ; R(s; a; s ) + max a 2A n Q (s ; a ) Unfortunately, these are nonlinear equations and cannot be solved analytically. There is, however, a dynamic programming algorithm, known as value iteration that iteratively approximates the optimal value function. Why is the optimal value function so important. Simply, because given the optimal value function we can easily derive the optimal policy for the MDP. It is known that the optimal policy is deterministic and greedy with respect to the optimal value function. That means that it picks only one action per state and actually the one that locally maximizes the value function. Formally, in state s, pick action a, such that: a = arg max a2a ( s 2S o h i ) T (s; a; s ) R(s; a; s ) + V (s ) a = arg max Q (s; a) a2a The beauty of the value function is that it turns a difficult global optimization problem over the long term (finding the optimal policy) to a simple local search problem. Notice that given V (s) we need the environment dynamics (T and R) in order to determine the optimal policy, whereas given Q (s; a) we need no additional information. This is a crucial feature that will be exploited later for model-free learning. However, it comes at the expense of a more complicated (2-dimensional) function. Solving MDPs There is a number of methods to solve an MDP, i.e. to determine the optimal policy. We briefly mention four of them. Value Iteration In this case, we try to find (or rather, approximate) the optimal value function by solving the Bellman optimality equations. The optimal policy can be easily found then. The idea is to start with an arbitrary value function and iteratively update the values using the Bellman optimality equations until there is no change. Eventually, the value function converges to the optimal. The cost is O(jAjjSj 2 ) per iteration, but the number of iterations can grow exponentially in the discount factor. Policy Iteration In this case, we manipulate the policy directly. Start with an arbitrary policy, evaluate it (solve the linear Bellman equations), improve it (make it greedy), and repeat until there is no change in the policy (converges to the optimal one). The cost in this case is O(jAjjSj 2 + jsj 3 ) per iteration, but the number of iterations is bounded by jaj jsj,
the total number of all possible distinct policies. Unfortunately, there is no proof that it can be that big! Modified Policy Iteration This method combines the previous two. Instead of solving exactly the Bellman equations to evaluate an intermediate policy (jsj 3 ), one can run a few steps of value iteration? O(jAjjSj 2 ) to just approximate it, the improve the policy and repeat. It turns out that the method converges to the optimal policy and is very efficient in practice. Linear Programming Finally, MDPs can be recast as Linear Programming problems. This is the only known polynomial algorithm, although it is not always efficient in practice. Reinforcement Learning So far we have assumed that a model is given and the dynamics are known. Uncertainty, however, is inherent in the real world. Sensors are limited and/or inaccurate, measurements cannot be taken at all times, many systems are unpredictable, etc. It is the case that for many interesting problems, the transition probabilities T (s; a; s ) or the reward function R(s; a; s ) are unavailable. Can we still make good decisions in such a case? On what basis? Let s begin from what is available. We assume that we have complete observability of the state, that is, we know where the state we are in with certainty 2. Besides, we can sample the unknown functions T (s; a; s ) and R(s; a; s ) through interaction with the environment. Based on that experience, i.e. reinforcements from the environment, can we learn to act optimally in the world? That leads us to the field of Reinforcement Learning (RL) that has recently received significant attention. It is better to think of RL as class of problems, rather than as a class of methods. These are exactly the problems where a complete, interactive, goal-seeking agent learn to act in its environment by trial-and-error. There are several issues associated with RL problems. We focus on three of them: Delayed Rewards In many problems, reward is received only after long periods. Consider chess or any similar game. The agent cannot be rewarded before the end of the game; only then the outcome is known. Credit assignment problem This problem has two aspect. Assuming that the agent receives a big amount of reward (or punishment) which one of the most recent (sequential) decisions was responsible for that (temporal credit assignment). Also, what of the system is responsible for that (structural credit assignment). Exploration and Exploitation Since the agent is not given information in advance it has to learn by exploration. Exploration results in some knowledge that can be exploited 2 The cases where there is hidden state information are formulated as Partially Observable Markov Decision Processes (POMDPs). We do not discuss POMDPs here, but we consider a case with hidden state later. to gain more and more reward. But there might be better ways to get rewarded. So, exploration and exploitation pull toward different directions. How can the agent balance the trade-off between them? Given the situation that an RL agent faces, there are two general ways to go: Model-Based Approaches Learn a model of the process and use the model to derive an optimal policy. This corresponds to the so-called indirect control. Model-Free Approaches Learn a policy without learning a model. This corresponds to the so-called direct control. For this work we focus on model-free methods for RL problems. Such methods mostly proceeded by trying to estimate (learn) the optimal state-action value function of the MDP. If this is possible, we can figure out the optimal policy without a model. Two broad classes of such methods are the following: Monte-Carlo (MC) Methods Monte-Carlo methods estimate the values by sampling total rewards over long runs starting from particular state-action pairs. Given enough data this leads to a very good approximation. By using exploring starts or a stochastic policy the agent maintains exploration during learning. Temporal Difference (TD) Methods In this case, the agent estimates the values based on sample rewards and previous estimates. Thus, TD methods are fully incremental and learn on a step-by-step basis. Moreover, the algorithms known as T D() provide a way to add some MC flavor to a TD algorithm by propagating recent information through the most recent path in the state space using the so-called eligibility traces. In both cases, it is possible to learn the value function of one policy, while following another. For example, we can learn the optimal value function, while following an exploratory policy. This is called Off-policy learning. When the two policies match, we talk about On-policy learning. Q-Learning: An Off-policy T D() method Q-learning (Watkins, 1989) is one of the most known and important RL algorithm. Q-learning attempts to learn the Q values of the state-action value function, and thus the name. The idea is as follows: You are currently in state s t. Take action a t under the current policy. Observe the new state s t+1 and the immediate reward r t+1. The previous estimate for (s t ; a t ) was Q(s t ; a t ). The experienced estimate for (s t ; a t ) is r t+1 + max fq(s t+1 ; a)g a New estimate for (s t ; a t ) is a convex combination (1? )Q (t) (s t; a t) + Q (t+1) (s t; a t) = h r t+1 + max a n oi Q (t) (s t+1; a)
where ( < 1) is the learning rate. Generalization and Function Approximation Finally, it should be mentioned that there are different ways to represent the value functions. The success or failure of a value function based RL method can depend on that. The naive way is to use a (huge) table to store the values Q(s; a). The upshot is that the table can represent any function and convergence is guaranteed. The downside on the other hand is slow learning. Besides the curse of dimensionality (exponential growth of space space) and handling of discrete spaces only can be severe handicaps to the tabular representation. Another way is to use a function approximator to represent Q(s; a). The intuition is that similar entries (s; a) should have similar values Q(s; a). Depending on the definition on similarity here and the generalization capabilities of the function approximator used, this representation can be a big win. Choices for a function approximator include, but are not limited to state aggregation, polynomial regression, neural networks, etc. The upshot in this case is rapid adaptation, handling of continuous spaces, and, of course, the generalization ability. The main drawback is that convergence is not guaranteed anymore. The Pendulum on the Cart Figure 2 describes the system we study in this paper. A moving cart carries a pendulum that can swing freely around its origin. Assume that the system is built in such a way that the pendulum can complete the circle. ϑ Figure 2: The Pendulum on the Cart This system is described by the following state equations: _x 1 = x 2 _x 2 = g sin(x 1)? amlx 2 2 sin(2x 1)=2? a cos(x 1 )u 4l=3? aml cos 2 (x 1 ) where the variables and parameters are as follows: x 1 : Angle, x 2 : Angular Velocity g : Gravity constant (g = 9:8m=s 2 ) m : Mass of the pendulum (m = 2: kg) M : Mass of the cart (M = 8: kg) u : Force (in Newtons), a; a = 1=(m + M ) It s easy to see that this is a nonlinear system. Controlling such a system to achieve a certain behavior or a certain state is not a trivial task. For our purposes, we ignore the model of the system. If the model were known, there would be no need to apply learning. There exist several powerful techniques (see for example, (Wang, et al., 1996)) that can solve the problem optimally. So, for this work, the model serves only as the simulator of the system dynamics, given the lack of a real cart with a pendulum. Control Tasks We consider two different task for our controller. Both are episodic tasks meaning that eventually they come to an end (either because some time limit passed or some event happened). In this section, we describe the set-up of the tasks as MDPs and the RL methods we used. Task 1 At the beginning of the episode the pendulum is initialized in downward position (angle=) with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and the sign of the angular velocity. Note that the magnitude of the angular velocity is unknown (hidden state). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen small enough so that the only way to move the pendulum higher and higher is by swinging back and forth. The objective is to find a policy that will drive to pendulum past the upward position (at any velocity) in a maximum period of 6 seconds. The system is rewarded by?jj, where is the smallest (discrete) angle from the upward position. Obviously, the reward is proportional to the (angular) distance from the goal and it is maximum () at the upward position. Control actions are taken at discrete time steps (every. seconds) and the remains constant between steps. Task 2 At the beginning of the episode the pendulum is initialized in upward position with some perturbation (angle= + ) and with zero angular velocity. The state information given to the system is a discrete value of the angle (63 values in total) and a discretization of the angular velocity in the range [?1; 1] (21 values in total). The controller has available actions corresponding to forces of -, -1,, +1, +. These actions are chosen big enough so that they can resist the force due to the weight of the pendulum. The objective is to find a policy that will balance to pendulum close to the upward position and close to a angular velocity for a period of at least 1 seconds. The system is rewarded with for being in the state corresponding to the upward position
and zero velocity or when the period of 1 seconds expires without a failure. If the pendulum falls beyond the downward position or the angular velocity exceeds the limit of 1 in absolute value, the system receives a reward of -1. In all other cases a constant reward of -1 is given. Control actions are taken at discrete time steps (every.1 seconds for better control) and the remains constant between steps. Implementation The system (simulator and controller) was implemented in MATLAB. The learning algorithm is Q(), a variant of Q- learning that uses eligibility traces to propagate information to the most recently visited state-action values at once. A learning rate of 4-. was used and exploration probability between.1 and.3. Long traces of about 7-9 were used and no discount factor ( = 1) since the tasks are episodic. Given that the state space is relatively small, a table-based representation is used to store the Q values. studying the differences (and surprises) between real world and simulation. Acknowledgments I would like to thank Prof. Wang for providing the MATLAB code of the model. Selected Bibliography 1. Sutton, R. and Barto A. 1998. Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998. 2. Wang, H. Tanaka, K. and Griffin, M. 1996. An Approach to Fuzzy Control of Nonlinear Systems: Stability and Design Issues, in IEEE Transactions on Fuzzy Systems, 4, 1, 1996 pp. 14-23. 3. Watkins, C. 1989. Learning from Delayed Rewards, Ph.D. Thesis, Cambridge University, 1989. Results For each case, we provide three diagrams: The state variables over time The trajectory in state space The over time Although learning time was not measured precisely it was between 4 and 6 minutes. Task 1 At the very beginning the system behaves in a more-or-less random way trying to figure out what to do. After some hundreds of trials the goal is achieved, as shown in Figure 3. After some more hundreds of trials and sufficient exploration the controller improves and the task is completed faster now, as shown in Figure 4. Task 2 Again at the beginning the behavior is rather erratic, but after lots of trials the controller manages to keep the pendulum inverted, close to the upward position as shown in Figure. Figure is a blow up of the state space, but it clear from Figure 6 that the pendulum stays inverted. Finally, increasing the episode period to 6 seconds didn t affect the controller that was able to keep control of the pendulum for the entire minute (see Figure 7. Conclusion and Future Work This work and others demonstrate that there is a lot of potential in learning approaches to control. However, there is more to be done before this technology becomes widely useful. Concerning this particular work I am mostly interested in figuring out ways to accelerate learning and make the controller adapt faster. Function approximation might be the choice here. On the other hand I am interested in applying RL methods on a real pendulum-cart system and
& 1 1 1 2 3 4 6 1 & 1 1 1 2 3 4 6 1 1 1 1 2 3 4 6 1 1 1 2 3 4 6 7 1 2 3 4 6 1 2 3 4 6 Figure 3: Successful completion of task 1. Figure 4: Better completion of task 1.
& 1 1 2 1 2 3 4 6 7 8 9 1 1 & 1 1 1 2 3 4 6 7 8 9 1 1 1 2.4.3.2.1.1.2.3.4 1 6 4 2 2 4 6 1 2 3 4 6 7 8 9 1 1 2 3 4 6 7 8 9 1 Figure : Successful completion of task 2. Figure 6: Successful completion of task 2.
& 1 1 1 2 3 4 6 1 1 6 4 2 2 4 6 1 2 3 4 6 Figure 7: Even better completion of task 2.