The Nature of Learning - A study of Reinforcement Learning Methodology

Size: px

Start display at page:

Download "The Nature of Learning - A study of Reinforcement Learning Methodology"

Brent Gibbs
6 years ago
Views:

1 The Nature of Learning - A study of Reinforcement Learning Methodology Anders Christian Kølle 1 Department of Computer Science University of Copenhagen November 21, 23 1 andersk@diku.dk

2 One needs to explore to learn.. 1

3 Abstract This project ia about the exciting field within adaptive computation and machine learning called reinforcement learning (RL). I will address theoretical and practical aspects of RL, and try to answer some of the core questions arising when trying to learn. How do we define optimal behavior? How do we learn the environment dynamics? How do we effectively gather instructive experience? I will look into the trade-off between exploration and exploitation in order to understand how to predict and control simultaneously. In addition the issue of learning to learn, an exploration strategy category called directed exploration will be discussed. I have experimented with different reward functions and discount factor values. I here found an initial side-effect of the learning rate α, which can prove valuable for practical implementations. I further experimented with the differences between on-policy and off-policy temporal difference learning and found some problematic aspects of Sutton and Barto s classical cliffwalking task. Following this more advanced learning methods that use experience more efficiently will be explored. Here I will address the Credit Assignment Problem and experiment with eligibility traces. I will propose a novel learning method called TD(λ, θ), which is based on a sigmoidal credit assignment distribution function, that advances the flexibility of credit assignment in relation to Sutton s TD(λ) method. Finally a look into the simulation of complex learning 1, where an individual learns by operating on mental representations of the world, a kind of reflection on ones experience. This is called model-based learning within RL. I have built a grid world based learning simulator, where I can construct control problems with different configurations. The simulator has integrated eight exploration strategies, two one-step learning algorithms, four trace based learning algorithms and two model based learning algorithms. All methods and basic parameters can be configured using the simulators graphical interface. The simulator has several predefined problems in a problem library including all the problems used for experiments in this project. I suggest the reader tries to experiment with the simulator after each chapter. The overall goal of this project has been to get an advanced, theoretical and practical understanding of reinforcement learning, through comparative studies and experiments with different methods and parameter-settings. In this process I have found some aspects of reinforcement learning that to my knowledge are new contributions to the field. The motivation has been the applicational potential in making systems that can experience, advance themselves, and find creative solutions that not even humans had imagined. 1 The term complex is not mine, it comes from Psychology (Atkinson et al.(1996))

4 Acknowledgements I would like to thank Peter Johansen for his efforts to boost AI and robotics at DIKU - it has been an inspiration. Thanks to Preben Alstrøm for insight talks about reinforcement learning. Also, I would like to thank the proof-reader Marie-Louise Svenn for her efforts, and to my girlfriend Kristine Blond for her patience the last two months. A special thanks to Rudy Negenborn. I have enjoyed the fruitful and philosophical discussions Rudy and I had about AI and learning. 2

5 Contents 1 Introduction Biological Learning meets Computational Learning Goal of the report Outline and contributions Project Limitations Exploring the Fundamentals of Reinforcement Learning The basic RL framework and assumptions What is optimal behavior? The Markov Property - approximating environment dynamics Markov Decision Processes Learning the Optimal Policy when P ss (a) and R ss (a) are Known Policy Iteration Value Iteration Learning the Optimal Policy from scratch - Pss a and R a ss are Unknown Approximating Pss a and R a ss with the learning rate α Temporal Difference Learning The Monte Carlo method Temporal Difference vs. Monte Carlo How to improve the policy The basic control algorithms: SARSA and Q-Learning SARSA: On-policy TD() control algorithm Q-Learning: Off-policy TD() control algorithm The Reinforcement Learning Simulator The Simulator: solving mazes and grid world problems Designing a RL system Representing the agent, the environment and the state The Reward Function and the discount factor γ - the solution guideline Comparative analysis of different reward function configurations Exploration in Learning - How to act and efficiently accumulate experience Converging towards optimality through α annealing First test for Exploration Strategies Undirected Exploration ɛ-greedy Exploration

6 4.3.2 Boltzmann-Gibbs Exploration Combination Strategy, Max-Boltzmann Directed Exploration - learning exploration heuristics Experiments with Directed Exploration Discussions Offpolicy methods versus Onpolicy methods Why is this an important issue? The cliff-walking task Eligibility traces - Who is to blame? n-step returns and TD(λ) learning Revising the credit assignment distribution function in TD(λ) Proposing an Alternative CAD Function Introducing Eligibility Traces, The Backward View of TD(λ) Proof of Equivalence TD(λ, θ) - Using Eligibility Traces Trace Based Control Algorithms The SARSA(λ) algorithm The Watkins Q(λ) algorithm The SARSA(λ, θ) algorithm and the Q(λ, θ)algorithm Experiments λ experiments SARSA(λ, θ) and Q(λ, theta): do they work? Replacing Traces Speeding up computation of trace based methods Planning - Reflection as a way of learning The Dyna-Q algorithm The Prioritized Sweeping algorithm PS related to trace based methods Experiments The size of the threshold η and the number of reflections in PS Model Based Methods versus Model Free methods Conclusions Contributions The discounting factor of the learning rate A grid world based reinforcement learning simulator Challenging exploration strategies with a specially designed test The Weakness of the Cliff Walking Task Temporal Credit Assignment Model-based Learning Perspectives on Reinforcement Learning and AI A RL Basics A.1 Users manual

7 Chapter 1 Introduction Learning saturates our lives. A waste amount of electro-chemical processes are continuously changing our mental and physical behavior, through never-ending structural modifications of synapses in our nervous system. The observable consequences of these processes are called learning.[4][6][24] 1.1 Biological Learning meets Computational Learning The general characteristic of a learning process is actually very simple. Learning may be defined as: a positive change in behavior in a given subject context resulting from experience 1. "A given subject context" should be understood as the way in which we need to keep the subject matter constant, if we want to observe learning. We also have to see a change in behavior that is positive in relation to the subject or problem at hand - an arbitrary change in behavior would not prove learning. What we are about to see is that the identified types of learning in human psychology, has a computational counter part, which fully adapts the idea of learning. Four different kinds of learning are identified by Atkinson et al. 2 : Habituation : Learning to ignore a stimulus that has no significant consequence. From a computational perspective habituation is tuning your function approximator. Habituation is finding out that you are generalizing in the wrong way. One is associating a stimuli with something it has no relation to. This could be the case in a process of classical conditioning as well as operant conditioning. Therefore I do not see this as a separate type of learning, but rather a question of good or bad generalization. Pavlovian Conditioning : Learning a causal relationship and associating one event with another. In Ivan Pavlovs classical experiment from 1927, a dog was tought to associate light with food, just like Jacobson, Fried & Horowitz (1967) learned primitive flatworms to associate light with mild electric shock. The association is learned when the conditioned stimulus gives the same response as the associated unconditioned stimulus. Other similar stimuli will now evoke the same response. Notice however, that we do not learn a novel behavior, we just learn to react with a known response to a new stimuli. 1 The authors own definition. 2 Definition by Atkinson, Smith, Bem, and Nolen-Hoeksema [1] 3

This is called supervised learning or associative learning in AI and is used when working with function approximators such as artificial neural networks.

8 This is called supervised learning or associative learning in AI and is used when working with function approximators such as artificial neural networks. Here the agent is present with a training set of input/output data pairs, the output is the wanted, correct response to the input. The learning process consists in modifying the synaptic weights of the input to each neuron, so that the error between the network output and the wanted response is reduced. In pattern recognition tasks with plenty of disposable training data, this has proved very successful. Operant Conditioning : Operant conditioning amounts to learning that a particular behavior leads to attaining a particular goal. 3 The agent operates on the environment and the feedback from the environment conditions its behavior. Through interaction with the environment, the behavioral patterns that lead to the goal will be reinforced by the conditioning feedback from the environment. Operant conditioning is therefore also called reinforcement learning. Figure 1.1: The Skinner Box. B.F. Skinner s 4 experiment is the most famous operant conditioning experiment, where a hungry pigeon or rat is placed in a bare box with a protruding bar which releases a food pellet if pressed. The rat moves around the box exploring, and presses the bar at some point releasing a food pellet - this reinforces pressing the bar, and the pressing rate increases. If the food pellet magazine is disconnected from the bar, then pressing the bar is no longer reinforced and the pressing rate diminishes. This unsupervised learning type is the focus of this report. In reinforcement learning the agent is not shown the target output, it is not given a solution. It is told what to do, but not how to do it. The agent learns a solution from its own experience, by taking actions in its en- 3 Rescorla, B.F. Skinner(1938), The Behavior of Organisms. 4

9 vironment and evaluating the feedback. How reinforcement learning(rl) algorithms are implemented is the subject of the next chapter. 5 Complex Learning : I don t like the term complex, since it does not tell us anything, about what type of learning we are dealing with. Atkinson et al.(1996) use this label to refer to learning through operating on mental representations of the world, cognitive maps - a kind of reflection on one s experience. This type of learning is called model-based learning or planning in computational reinforcement learning, because that s all it is. We will get back to that later in chapter 7. As you can see there is a computational learning counterpart to all identified biological learning types. In other words this report is not about a bio-inspired heuristic, that can simulate a learning process, but instead this project is about true learning and the problems one face when trying to learn. 1.2 Goal of the report This project aims first to give a thorough description of reinforcement learning principles. It has been my goal to describe reinforcement learning in a manner that makes the project understandable for readers not familiar with reinforcement. In addition I also had a goal of making smaller contributions to the field and finding new aspects of reinforcement learning, hopefully of interest to people in the field. 1.3 Outline and contributions Chapter 2 describes the theoretical basis of reinforcement learning. How do we define optimality? How do we approximate environment dynamics? How do we learn from experience? I compare policy iteration and value iteration, and experiment with the size of the Bellman residual to discover some interesting convergence properties. Chapter 3 describes the grid world based RL problem simulator, and discusses implementational issues of reinforcement learning, fx how we design the reward function. I experiment with different reward functions to observe its effect on convergence. What influence does the discount factor γ have on optimal behavior and convergence? I find an interesting discounting property of the learning rate α. The learning rate actually discounts future rewards in a initial fase. Chapter 4 describes exploration strategies. I compare and experiment with three undirected exploration strategies and three directed strategies. I look into the issue of Exploration vs. Exploitation and identify the balance with each strategy through multi-goal maze problems, where the optimal solution demands exploration as well as exploitation. Based on the experiments I discuss the advantages and disadvantages of each strategy. 5 There are other types of unsupervised learning besides RL in machine learning, such as Hebbian learning. See D.O.Hebb(1949), The Organization of Behavior. 5

10 Chapter 5 describes the difference between On-Policy learning and Off-Policy learning. I implement a more aggressive version of Sutton and Barto s classical cliff walking task, and give a critique of the task. Based on the experiments I discuss the advantages and disadvantages of off-policy and on-policy. Chapter 6 describes the temporal credit assignment problem and eligibility traces. The Markov property is revised in the light of the delayed reward issue. The λ return algorithm and its build-in credit assignment distribution(cad) is discussed. Based on the identified problems with the CAD in the λ return algorithm, I develop a new CAD based on a sigmoidal function. I show how the sigmoidal CAD can be used in a new trace based learning algorithm TD(λ, θ). In addition, I experiment with the different trace based control algorithms and compare three different trace update rules. Chapter 7 describes model based reinforcement learning. I describe the idea of planning in reinforcement learning and analyze two different planning methods. Peng and Sutton s Dyna-Q and Moore and Atkinson s Prioritized Sweeping. I experiment with the number of reflections per time step, and compare model based and model free methods. Chapter 8 concludes this report and reviews what has been accomplished. 1.4 Project Limitations Learning to solve real world problems demands the ability to generalize, since the world never looks exactly the same. This make the environmental state space continuous and infinite, demanding that you implement a data structure that represents the state space in an indirect generalized manner, such as a neural network. However, in this project the state spaces have a discrete, finite, and tabular representation. In other words, the agent will not be able to use its learned behavior in unexplored territories through means of generalization. I started reading and implementing an artificial neural network in the beginning of the project fase. However, the idea behind the learning algorithms do not change and therefore the use of function approximators has been dropped to keep the focus on learning. 6

11 Chapter 2 Exploring the Fundamentals of Reinforcement Learning 2.1 The basic RL framework and assumptions Environment a t S t+1 State Transition S t s t+1 r t a t Agent Figure 2.1: The basic reinforcement learning process, a formalized model of the Skinner box. Imagine a reinforcement learning agent - a robot - that tries to pick up a glass. The agent observes its environment with discrete time intervals, it takes an action a t at time t and observe one time interval later at time t + 1, that the state of the environment has change from s t to s t+1 - this state transition might be that it tipped over the glass. The agent records the state transition and receives at the same time a scalar value r t called a reward. The reward describes how good the state transition was in relation to the problem at hand. If the state transition does not clearly inform about how the robot is doing - i.e. if the agent is not yet touching the glass - the reward will probably be zero. This interaction process between agent and environment is described in figure 2.1. As you can see, the process is analogous to Skinners operant conditioning experiment. 7

12 The goal of a reinforcement learning agent is always to maximize a measure of long run accumulated reward. It will take the action it believes will lead to the maximum amount of reward, and this is not always the action with the immediate highest reward. Furthermore, there will often only be a few state transitions that without a doubt are 1% good. In our example picking up the glass, resulting in a positive reward is the good transition. The agent has to determine how to trigger these transitions as often as possible. As you can see the described reinforcement learning system consists of a discrete set of environment states S, a discrete set of actions A that the agent has at its disposal, and a set of real or natural scalar rewards. These sets are assumed to be finite. The state and action space can be continuous, but discretizing it makes implementation math more pleasant to work with. s^ t Agent s^ t s^ t s t Perceived State function Value Policy Π function a t ^r t Figure 2.2: inside the reinforcement learning agent. In figure 2.2 we take a closer look at the agent. First a state perception system perceives the state of the environment. The perceived state and reward is used to update a value function, that gives each state a scalar value, which in turn is an estimate of how good it is to be in this state in relation to solving the task at hand. The value estimate tells us if the state is on a potential big reward path. If so the agent is on the wanted path that will maximize reward in the long run. The perceived state is then given to the policy π that selects the action to be executed. The policy updates itself using the value function. The arrow is not solid the other way around, because it is not necessarily the case, that learning methods take the policy into account. We will return to the issue of off-policy vs. on-policy learning later in chapter 5. First question. Can the robot always perceive the state of the environment correctly? For real world robots, there will always be several situations where the environment state is only partially perceivable by the robot. 1 A special case of partial observability is when distinct environment states, that demand different actions, appear identical to the agent - this is called perceptual aliasing. This project will devote its time not to perception but to learning, 1 The mathematical framework for dealing with partial observability is called POMDP (Partially Observable Markov Decision Processes), and reinforcement learning methods can also be applied to these type of problems. 8

13 and therefore we simplify perception and assume fully observable states. The perceived state ŝ t is thereby equal to the actual state s t of the environment. As to the environment, it will be assumed non-deterministic, but stationary. Non-deterministic means that taking the same action a in the same state s, does not necessarily result in the same state transition, the next state and reward can vary. Stationary means that this variation in state transitions and rewards is a fixed probability distribution. The environment dynamics do not change over time. Now, don t be too disappointed, thinking but what is the use? Is the real world not largely non-stationary? These methods still adapt in non-stationary environments, they just do not assume the dynamics to change What is optimal behavior? We know the goal is to maximize the received reward in the long run, but before we can design learning algorithms, we have to define a formal model of optimality based on maximizing reward. The question is, when we stand at time t and want to take the action maximizing total reward and if we could see the reward consequence of all future actions, what aspects of the infinite reward sequence k= r t+k should we focus on? Because we cannot see into the future, we instead estimate it, and that is what the value function in figure 2.1 does, but we still have decide what model of optimality we want to estimate. There are three major models of optimality: The Finite Horizon Model This is the simplest model. At a given moment in time (t), we optimize the expected reward sum R t received the next n discrete time steps: R t = E( n r k+t ). k= We do not look at rewards after time t + n, but take the action that is optimal within this finite horizon. This is not an optimal solution if the agent lives longer than n time steps. Imagine an episodic task, where n + 1 steps from the start state there was the essential grand reward. Making a path to it would yield optimality. However, all paths to this reward started out by taking a specific action in the start state. A policy based on n-step estimates would not take the essential reward into account, when selecting an action, and therefore never converge to real optimal policy. For episodic tasks, where the number of discrete time steps used in an episode is known, the finite horizon model can be used. The Infinite Horizon Discounted Model If the agents lifetime is undefined and the task is continuous, then we don t want to set 2 Some researchers are working on a novel mathematical framework called Hidden Mode Markov Decision Processes or HM-MDP s, which deals with smaller changes in environment dynamics. 9

14 a clear time limit on what rewards that should be accounted for. R t = E( γ k r k+t ) (2.1) k= We still need to bound the sum, otherwise we are estimating infinity. The future discount factor γ, where < γ 1, bound the sum, which has the consequence that we prefer initial rewards. The model is without a doubt the most used optimality model in reinforcement learning algorithms, and also the optimality model integrated in all of the learning algorithms I experiment with. The discount factor must not be underestimated in its influence on the optimal policy, and be set so that the agents horizon is appropriately discounted in relation to the size of the task at hand. The Average reward Model We can bound the sum in a different way, by taking the average: R t = lim n E( 1 n n r k+t ) With this model, we do not prefer initial rewards over long run rewards - policies primarily based on initial rewards will therefore not distinguish themselves in estimated performance from other policies. It is important to notice that the optimal policy actually will differ if the optimality model that the learning algorithm is estimating is changed. The continuous task in this example environment in figure 2.2 shows that the three optimality models, gives three different optimal policies. k= START The optimal policy with finite horizon, h=4, ends up here. +2 The optimal policy with infinite horizon, gamma=.9, ends up here. +1 The optimal policy with average reward ends up here. +11 Figure 2.3: (The unlabelled arrows give a reward of zero.(from Kaelbling, Littman & Moore(1996) Reinforcement Learning: A Survey) One can easily take the infinite discounted horizon model for granted when working with reinforcement learning, because all learning algorithms that you see are based on it. Figure 2.3 shows, however, that there are alternatives. 1

15 2.3 The Markov Property - approximating environment dynamics As we have seen the reinforcement learning policy makes decisions at each time step based solemnly on the current state of the environment, since, π(s t ) = a t. What information must the state signal hold to enable the agent to make a policy for choosing actions as a function of the current state alone? In several tasks in life, decision-making is based on a complex structure of past sensations/stimuli as well as current sensations. This is because the environment dynamics response depends on such a complex structure of information. If decision-making is to base itself on the current state signal alone, the state signal needs to hold all relevant information about past sensations as well as current. If this is the case the state signal is said to hold the Markov property. We can look at the agent-environment interaction as a time continuous stochastic process governed by the environment dynamics. This process is called a discrete 1. Order Markov process if for all s, r and all possible values of the past events s t, a t, r t,..., r 1, s, a, it holds that P {s t+1 = s, r t+1 s t, a t, r t, s t 1,..., r 1, s, a } = P {s t+1 = s, r t+1 s t, a t } This states that only the current state and the selected action gives any information of the future behavior of the agent-environment interaction process. Knowledge of the history of the process does not add any new information. Because of the one-step dynamics, we can through iteration predict all future states and rewards based on the current state alone. We now have the answer to the section question, what information must the state signal hold to make a policy for choosing actions as a function of the current state? The answer is the state must be a Markov state. The Markov property of the state signal is the foundation of all the reinforcement learning methods introduced in this report. The model-based methods introduced later, will also try to model an environment governed by one-step dynamics, no questions asked. These reinforcement learning agents can still act in a non-markov world, they just base their actions (the policy) and predictions (the value function) exclusively on the state of the environment. If the dynamics of the environment become too unstable also known as chaotic, where predictability becomes almost impossible because dependence becomes so sensitive that as Edward Lorenz 3 himself puts it: the flap of a butterfly s wings in Brazil could set off a Tornado in Texas, Markov based reinforcement learning methods of course would have big problems. However, not even humans can predict the outcome of these systems, so it is hardly a weakness. 4 3 The chaos theory idea known as sensitive dependence on initial conditions, was a circumstance discovered by Edward Lorenz (who is generally credited as the first experimenter in the area of chaos) in the early 196s. He held a talk in 1972 called "Predictability: Does the Flap of a Butterfly s Wings in Brazil set off a Tornado in Texas?" 4 We can now state the assumption of stationary environment dynamics, in the framework of Markov Processes. The assumption means that we only concern ourselves with processes who have so-called timehomogeneous properties. Let {X(t), t } be a discrete Markov process. If the conditional probabilities, P (X(s + t) = j X(s) = i), for s, t, do not depend on s, the process is said to be time homogeneous. In other words the transition probabilities only depends on which state the process is in and not on the time. 11

16 2.4 Markov Decision Processes Markov decision processes(mdps) are controllable Markov Processes 5, where the solution to an MDP is a policy that maximizes behavior in relation to a specific reward structure. A reinforcement learning task is a MDP solved through interaction between the agent or controller and the one-step dynamical environment. An MDP is called finite if the state and action space is finite. For a finite MDP, we can describe the environment dynamics by first, the probabilistic state transition function P, P ss (a) = P (s t+1 = s s t = s, a t = a), (2.2) where p, q S and a A, that tells us the probability of reaching the next state s t+1, when taking action a t in s t. Second the reward function R, R ss (a) = E(r t+1 s t = s, a t = a, s t+1 = s ), (2.3) tells us the expected reward value if a given state transition (p q) is made with a given action a. If P and R are known, the agent will know the exact consequence of taking an action. It obviously helps to know the environment dynamics, when you are trying to solve a problem through interaction with the environment. We will now look at how the agent finds the optimal solution when the environment dynamics are known. 2.5 Learning the Optimal Policy when P ss (a) and R ss (a) are Known Given the transition and reward function, the goal is to find the optimal policy π that maps states to actions in a way that maximizes discounted future rewards. We use the value function V π to evaluate the policy π s action choice, and improve π. V π (s) is discounted estimate of how much reward we can expect to accumulate, if we follow policy π from state s. We define V π (s) as V π (s) = E π {R t s t = s} = E π { γ k r k+t+1 ) s t = s} Here we see the use of the infinite horizon model from equation 2.1. We can use Dynamic Programming(DP) to compute the optimal policy for infinite horizon MDPs, because MDPs satisfy the condition of optimal substructures and overlapping subproblems. Let s take a closer look at the value function to reveal the existence of the optimal substructures. k= V π (s) = E π { γ k r k+t+1 ) s t = s} = E π {r t+1 + γ γ k r k+t+2 ) s t = s} k= 5 Wiering(1999) k= 12

17 Standing in state s, taking action π(s) we can give an exact description of the reward r t+1, because we know P and R: V π (s) = [ P ss (π(s)) R ss (π(s)) + γe π { s S k= V π (s) = [ ] P ss (π(s)) R ss (π(s)) + γv π (s ) s S ] γ k r k+t+2 ) s t+1 = s } (2.4) Equation 2.4 is called The Bellman Equation. If π was the optimal policy π, then 2.4 is referred to as the Bellman Optimality Equation. The recursive nature of the Bellman Equation is the foundation for updating the value function in place. All reinforcement learning methods including Dynamic Programming that base their values in part on the value of the next state is called bootstrapping methods. Now we need to compute the optimal value function and policy by gaining a lot of experience, so we can update our value estimates and afterward use it to improve the policy. One algorithm to do this is called Policy Iteration Policy Iteration Policy Greedy ( V π k) = π k+1 k Policy Improvement Policy Evaluation π π Value Function V π k Figure 2.4: Policy Iteration: Iterative improvement of π and V. As you can see in figure 2.4, for each iteration k we get experience selecting actions with π k, and updating V with V π k, whereafter the updated value function V π k is used to improve the policy. Implementing the policy evaluation step We implement the policy evaluation step by iteratively using the Bellman equation, { } Vk+1 π (s) = E π r t+1 + γvk π (s t+1) s t = s = [ ] P ss (π(s)) R ss (π(s)) + γvk π (s ), s S (2.5) s S 13

18 Because we know P and R we are able to make an update V π (s), taking all possible next states into account. Equation 2.5 is thus called a full back up of V π (s). The question is: Can we use Bellman in this way replacing V π with V k? The answer is yes, if V k at some point equals V π. V k has been proved to converge to V π for k. In the implementation we stop the iteration when the maximum difference between Vk+1 π (s) V k π (s), for all s S is smaller than a specified stability threshold ɛ, also called the Bellman residual. V π is then bounded by V π (s) V (s) 2γɛ 1 γ. (2.6) Therefore, the policy may not be optimal after several iterations because of the residual. I implement the policy evaluation algorithm by updating V in place, so that V k+1 (s ) is used in updating V k+1 (s) instead of V k (s ). This converges faster than when using two arrays and updating after a whole state space sweep. Implementing the policy improvement step We should change our policy choice of action in a given state, if it leads to higher total reward. To estimate the value of a specific choice of action in a given state, we introduce the action values function Q: } Q π (s, a) = E π {r t+1 + γv π (s t+1 ) s t = s, a t = a If we now try to change our action choice to a π(s) and Q π (s, a) > V π then we should update π(s) with the improved action choice a. This gives us a new policy π, where V π V π, s S. This is called the policy improvement theorem. 6 The policy improvement step is easy, we just consider all possible action values in each state and choose greedily, [ ] P ss (a) R ss (a) + γv π (s ) s S. π (s) = arg max a Q π (s, a) = arg max a s S We have now obtained an improved policy π. In policy iteration V and π are reached through the iterative process shown in figure 2.4. At some point in the process the value function and policy converge to an equilibrium state, where π k+1 = π k, giving us the optimal policy according to the Bellman Optimality Equation 7, unless we use a large residual. I experimented with the size of the Bellman residual, and even with an infinite large residual always resulting in only one policy evaluation sweep, policy iteration converged to optimality or near-optimality. For the Bellman residual experiment I have used a discrete maze implemented as a 2x2 grid world with one goal state and random distributed obstacle states. The reward structure is goal = 1 and obstacle = -1. The discount factor γ is set to.99. The agent ran for 1 6 For proof see Sutton & Barto(1998) p95. 7 See Sutton & Barto(1998) 14

19 1 5 Policy Evaluation 1 3 Policy Improvement 1 4 No. updates changes >.1 per policy iteration res:1 Episode 42 res:.1 Episode 7 No. policy changes per policy iteration res:.1 Episode /Policy Iterations /Policy Iterations Figure 2.5: Observing the policy iteration process when varying the Bellman residual. episodes, where the CPU time used in each episode included a policy evaluation and improvement step, solving the maze with the improved policy, and recording the total reward received. The agent does not try to solve the maze in the two initial episodes, because at this point the policies have sometimes not converged to a solution that will reach the goal and terminate. Looking at the Reward per episode graph in figure 2.6, you can see that for the Bellman residual = 1, it has converged to near-optimality around the 23 th episode, but does not converge to the optimal policy before episode 32. Optimality is detected 1 episodes later, when the policy does not change from one iteration to the other, if the Bellman residual is very small. However, because of large residuals, we also need to see no changes >.1 in values for 2 iterations. You can spot convergence to optimality, where Policy evaluation stops, marked with an arrow - this also results in a drastic drop in CPU time per episode, because the policy iteration stops at this point and the rest of the episodes are solved by the optimal policy without further updates. The conclusion is that a fairly small residual is preferable both when observing total CPU time and total reward per episode. Very small residuals spends a bit more time the first few episodes, but on the other hand converge so fast that initial bad episode-runs which large residuals suffer from, are avoided. In the long run it is unambiguously better and faster. Policy Iteration is slow per iteration for large state spaces, because the time complexity of the policy evaluation step is high,o(s 3 )(Wiering(1996)). The good thing about policy iteration is that it takes surprisingly few evaluation-improvement iterations to converge. 15

20 Accumulated Reward 1 x Total Reward res: res: res: 1 Accumulated CPU time/sec Total CPU time res: 1 res:.1 res: x 14 Reward per episode 1 2 CPU time per Episode Reward CPU time/sec Figure 2.6: The consequences for reward and CPU time when varying the Bellman residual under policy iteration Value Iteration Another algorithm is called Value Iteration. It is faster per iteration, but requires more iterations to converge. Value iteration is an off-policy algorithm with only the policy evaluation step of policy iteration - here you don t evaluate a policy but take the action maximizing the state value: V k+1 (s) = max a = max a { E s S } r t+1 + γv k (s t+1 ) s t = s [ ] P ss (a) R ss (a) + γv (, s S It can be shown that V k converges to the optimal value function V. Now a single policy improvement step is all that is needed to obtain the optimal policy. The Value iteration process is stopped in practise like Policy evaluation using a residual. Figure 2.7 confirms, what Wiering showed in 1999, that Value Iteration outperforms Policy Iteration. The Policy evaluation iterations in Policy Iteration, which is based on initial, cyclic non-solution policies, simply waste time. Furthermore, in Value iteration the maximizing action is found in place just like the values are updated in place, making value iteration more efficient per iteration. As you can see in this similar maze-setup, Value Iteration is expected faster, and in this experiment it even has a higher total reward. I must admit that the graphs are based on one experiment only, merely to confirm Wierings results. The better performance is first of all based on CPU time, since both methods can iterate until k s ) (2.7) 16

21 Accumulated Reward Value Iteration Total Reward 324 Policy Iteration Accumulated CPU time/sec Total CPU time Policy Iteration Value Iteration Reward per episode 1 2 CPU time per Episode Reward CPU time/sec Figure 2.7: The consequences for reward and CPU time when varying the Bellman residual under policy iteration optimality Learning the Optimal Policy from scratch - P a ss and Ra ss are Unknown We are about to solve finite MDPs based on the infinite horizon optimization model, but this time without a world model. This situation is entirely different, and much more realistic. The agent does not know the consequences of its actions in any of all possible states of the world. It cannot predict prior to acting the transition probabilities or the immediate reward consequences. The agent cannot iteratively loop though state 1 to n in a DP-like manner just thinking its way to the optimal policy, it needs to actually experience or observe a specific state transition. One of the questions we now need to address is, how does the agent get the necessary experience? We will look into this aspect of learning to solve real world problems later, and for now just assume that we are getting a qualitative stream of experience samples at discrete time-steps from following a policy π. Experiencing in policy iteration is the policy evaluation step. Here we did a full backup when updating values using P a ss and R a ss. With no world model, we rely on the samples of experience we get - updating with samples is called a sample backup. Can we still use the Bellman equation as in the Policy evaluation step in Policy Iteration? Yes, we can. Learning is done through a revised general form of policy iteration and value iteration, which Sutton and Barto calls General Policy Iteration(GPI) - because it is only the general idea of iterating back and forth between Policy Evaluation and Policy Improvement that is the same. Some 8 Wiering

22 learning methods use GPI in a value iteration way, these are called off-policy, others are closer associated to policy iteration and are called on-policy methods. We will look into the differences between off-policy and on-policy methods in chapter 4. But how do we estimate V (s) when we don t know P a ss and R a ss? In a stochastic environment a single sample backup only tells us one consequence of a state/action pair(sap) out of a whole distribution of possible transitions. We need to consider how to use the samples to approximate the transition distribution and reward distribution when updating state values Approximating Pss a and Ra ss with the learning rate α When updating V (s) we take into account that this is only a sample backup by way of the learning rate α. The learning rate is introduced to approximate the value of the real world transition probability distribution. V (s t ) = (1 α)v (s t ) + αr t (2.8) As you can see, the α part of V (s t ) is replaced with the future reward estimate R t, based on sample experience received after time t. Transitions from s to s that appear more often, and therefore have a higher transition probability than others, will similarly more often replace an α part of V (s), thus resembling its high probability. If the environment is deterministic, sample backups equal full backups, and the learning rate becomes meaningless. One thing to notice about the learning rate, is the fact that old samples are not valued as much as new values. If we follow the development of V (s) over n time steps from time, we end up with V n (s) = (1 α) n V (s) + α n (1 α) (n t) R t. (2.9) The first term will be zero for n. As you can observe the last receive sample R n has a weight of α, the predecessor sample R n 1 has a weight of α(1 α) and so forth, giving as a normalized weight distribution. The weight distribution decreases exponentially with time. For a stationary environment the learning rate should be rather small, as the samples do not decrease in value over time. One will get a better approximation of transition probabilities with a smaller α. On the other hand, our idea of how environment dynamics work are also more rigid, with a smaller α value Temporal Difference Learning Model-free methods that estimate R t using other state estimates, just like DP, are called Temporal Difference Learning. The simplest of these methods is TD(), 9 which, just as DP, is based on the recursiveness of the Bellman Equation and the assumption of the Markov Property. Every iteration is a correction of the estimate carried out at each discrete time step - but with a more rough approximation than under DP, as we only have sample experience as an indicator of transition probabilities and reward distributions: 9 The name TD() will make more sense when we get to eligibility traces in chapter 5. t=1 18

23 V π (s t ) = (1 α)v π (s t ) + α(r t+1 + γv π (s t+1 )) (2.1) Here we only update V π (s t ) when receiving r t+1, but what about V π (s t 1 )? r t+1 is also part of its future rewards - and the same can be said about V π (s t 2 ), although the uncertainty regarding the importance of r t+1 as part of the value estimate definitely grows. This issue is called The Credit Assignment Problem, and will be treated in the chapter about Eligibility traces. For now we will focus on so-called one-step learning such as TD() The Monte Carlo method Temporal Difference methodology is one way sample experience can be used to learn, but by no means the only way. Monte Carlo methods are different in that they base the state value exclusively on experience, not on other state value estimates, like TD methods. A Monte Carlo agent does not base its value estimates on the assumption that the environment dynamics can be approximated by the Markov Property or on any other assumptions for that matter. The Monte Carlo agent is what one could call a real empiricist only basing its conclusions on pure empirical results. Monte Carlo records all the rewards following a state, V π (s t ) = (1 α)v π (s t ) + α n r k+t (2.11) where n is the terminal state, before updating the state value. This of course only makes sense in episodic tasks, otherwise Monte Carlo would have to wait for an infinite long time before updating. The obvious practical downside of Monte Carlo is that the agent does not use its experience right away - it walks around for a whole episode before getting smarter Temporal Difference vs. Monte Carlo Lets assume that time was not an issue, we just want to use the experience we have gathered so fare, to give the best possible prediction of the future rewards following each state. In the example described in figure 2.8, the agent has so far gathered 4 samples of experience - sample 1 showed a transition episode from s 1 to s 2 terminating at s 5 with a total reward of zero, the three other samples showed an episode starting in s 2 and terminating in s 4 without a total reward of 2. We will now ask the TD and Monte Carlo agent about the value of the states - they can iterate as much as they want to over the four samples. The updating of the value function is updated once - this is called batch updating, because we wait until we have a batch of experience. What should the value of state s 1 be? The Monte Carlo agent sets V (s 1 ) =, where as the TD agent sets V (s 1 ) =.75(γ + γ 2 ). Monte Carlo has only experienced a zero return after visiting state s 1, so the value is set to zero. TD on the other hand bases its value on the assumption that the environment holds the Markov Property: state transitions from s 2 only depends on the action taken in s 2, not on past SAPs or rewards. TD therefore notices that current transition probabilities for s 1 is as of right now 1% to s 2 with zero reward, and therefore V (s 1 ) = γv (s 2 ). 1 k= 1 Example is similar to Sutton(1998) 19

24 Sample 1 Terminal State s5 r= Sample 2,3 and 4 s1 r= s2 r=1 r=1 s3 Terminal State s4 Figure 2.8: 4 samples of experience. What should we conclude about the environment dynamics? The blue states are start states, the yellow terminal states How to improve the policy So far we have looked at the part of GPI that concerns policy evaluation. The policy improvement step was really simple when we knew P a ss and R a ss, because we just looked at which action maximized P a ss s S Ra ss + V π (s ). Now we don t exactly know where an action is taking us! The state value function is thus a bad help when we want to improve the policy, and we therefore reintroduce the action value function Q. Rewards are instead associated and recorded as a function of state and action. The state value function V (s) in the agent is replace with Q(s, a), and the learning algorithms that come out of this substitution are called control algorithms, as they are now suited for updating a control policy. We can now improve the policy π greedily as before, π (s) = arg max E π {Q π (s, a)}. a The difference is in practise, we use a very rough estimate of Q π, as the policy evaluation step does not iterate until the Q-values are stable. The estimates are even rougher than a policy evaluation step using an infinite Bellman residual, because that at least includes a whole state space sweep. When using sample experience instead of state sweeps, this greedy strategy does not guaranty convergence. Instead it has the consequence that we don t get the necessary experience, as the policy decides where we are going. We therefore need to take actions that, based on the current Q-value estimates, look non-greedy. The policy improvement step is where we decide how to explore the world, how to act. The policy improvement strategy is therefore referred to as the exploration strategy, and is just as important as the learning algorithms themselves. We will look at exploration strate- 2

25 gies in chapters 4 and The basic control algorithms: SARSA and Q-Learning I will here shortly introduce the two basics control algorithms SARSA and Q-Learning, so that we are equipped to go on to solve control problems with reinforcement learning SARSA: On-policy TD() control algorithm When we transforms the TD() learning algorithm directly into a control algorithm, by substituting V π with Q π, we get a control algorithm called SARSA: ( ) Q π (s t, a t ) = (1 α)q π (s t, a t ) + α r t+1 + γq π (s t+1, π(s t+1 )) (2.12) The algorithm is called SARSA, because we need a State s t, an Action a t, a Reward, a State s t+1, and an Action π(s t+1 ) - before we can make an update. SARSA estimates Q π and converges towards Q when π converges towards π, just like policy iteration Q-Learning: Off-policy TD() control algorithm Off-policy algorithms estimate Q instead of Q π, and therefore remind us of the value iteration algorithm: ( ) Q (s t, a t ) = (1 α)q (s t, a t ) + α r t+1 + γ max Q (s t+1, a t+1 ) (2.13) a t+1 In this way we experience the consequences of policy π, but learn π. Now we are ready to experiment and advance our understanding of learning to solve real world problems. 21

26 Chapter 3 The Reinforcement Learning Simulator 3.1 The Simulator: solving mazes and grid world problems You can come a long way with grid worlds! You know the optimal solution and you have a chance of understanding the behavioral details of the agent. Obstacle Start State type 1 12 Goal Maze grid height number Maze grid width number Figure 3.1: This example shows the three DP agents from the Bellman residual experiment showing what they think is optimal behavior. As you can see they all reach an optimal solution. The grid world maze problem is an episodic task that stops when the agent has found the goal. The agent has in this example four actions to choose from: North, South, East, and West. The maze is surrounded by a wall, so if the agent tries to walk out of the maze, it will bump into the wall and end up in the current state - the same goes for obstacle states. There are two additional state types - one terminal state, ending the episode, and one non-terminal 22

27 state. These are used to make a sub-optimal goal state, and a penetrative obstacle state. The grid world mazes I have implemented come in variety of configurations. One can scale the problem in different ways. First scaling parameter is the size of the maze: height and width can vary, but this does not limit the maze configuration to squares, because one can build walls on the square base. Second, One can add a topological dimension, so that the agent also has to take the landscape gradient into account. As of right now, this does not advance the problem in the way, that it was intended. The idea of adding a topological dimension could be to experiment with hierarchical reinforcement learning actually solving two problems at the same time. However, I found that in order to do this, I would have to implement a physics engine simulating a car driving around in a topological landscape and a neural network to deal with vast amount of states. This would have changed the perspective of my work away from learning. Instead the topological dimension is of now implemented so that the reward is negative if the agent takes a discrete step up hill and positive if agent starts to go down hill. Figure 3.2: This example is a SARSA and a Q-Learning agent solving the maze with the topological dimension turned on. The gradient is based on a local perspective that changes with the direction of the agent, the local gradient g is based on the difference in the z-coordinate from state s to s : g(s, s ) = tan 1 (s s ). This gives a lot of reward information at each step, and some would say too much. Richard Sutton ones wrote, you should tell the agent what you want, not how you want it. However the topological dimension still can give a different perspective on the reward function, in a way that which 4-connectivity standard mazes cannot. Third scaling parameter is the number of actions: the agent can have four or eight actions at its disposal, increasing the number of SAP s with a factor if two. Eight connectivity 23

28 is always used when the topology is turn on. The last configuration parameter is P a ss. I have limited the configuration of P so that for state s, P a ss =, for all non-neighbor states s. A transition distribution like [2,1,76,1,2] for a 4-connectivity agent means, that there is a 76 % chance of performing the chosen action, a 1% chance of tilting right or left of the chosen action, and a 4% chance of stepping in the opposite direction of the chosen action. In this world it is thus not possible to make a two state long jump. Look in chapter 8 for a description of the graphical user interface of RL Basics. 24

29 3.2 Designing a RL system Representing the agent, the environment and the state When implementing a reinforcement learning system, we first have to define what we mean by environment and agent. The agent boundary stops at the point of decision-making and learning. The agent receives a state signal and a reward from a system that must be defined as part of the environment. In robots, a sensor-processing system usually outputs a state signal and sends it to the reward function and the RL agent. In our case, sensor calibration and processing is trivial, but for real world robots, this is where you spend your time! The state is where you can get really creative with RL. How should you represent the world to the agent? You want to limit the state space as much as possible without preventing the agent from learning to solve the problem. The state representation is only limited by the available environment sensors. In our case the state is the agents position in the grid world The Reward Function and the discount factor γ - the solution guideline start goal Figure 3.3: This is a 4-connectivity 1x1 maze problem, There is obviously a long way and a short way through the maze. But will the agent prefer the short one? We look at two simple reward functions.. The designers knowledge of the problem solution is the base of the reward function R a ss. The question is how much information should we put into the reward function? Our RL agent is very well raised in that we can count on it following the guidelines we have set out - it will maximize reward in relation to our specified reward function, and will value closer 25

30 rewards higher than distant future rewards, if we discount the future. So we really have to be sure about what kind of behavior we reinforce. For an episodic task with a defined goal state, like the maze problem, the simplest reward function only rewards a transition to the goal, all other transition gives zero reward. If we want the agent to find the shortest path to the goal with this reward function, we need to discount future rewards, meaning γ < 1. Otherwise all paths leading to the goal at some point, will be just as good as the shortest path. 9 Total Reward 8 gamma =.9 7 Accumulated steps gamma = Figure 3.4: Both Q-Learning agents have α =.5, ɛ-greedy exploration factor ɛ =.1, discount factor is respectively γ = 1 and γ =.9. The maze is 1x1 and shown in figure In the experiment in figure 3.4 it does not seem like γ = 1 is a problem, with convergence happening even faster than if we discount future rewards! Let us take a look at the propagation of the goal reward. The explanation to the convergence is that the learning rate α seems to discount future rewards. S 2 S 1 r GOAL S 3 Figure 3.5: Only the transition to the goal state gives the reward r. If we let the agent run through these three states for infinity we see some interesting changes in the values of the three states. The way the reward propagates through the states 26

31 is depicted in the following table. No. V (s 1 ) V (s 2 ) V (s 3 ) 1 αr 2 (1 α)αr + αr α 2 r 3 (1 α) 2 αr + (1 α)αr + αr 2(1 α)α 2 r + α 2 r α 3 r 4 (1 α) 3 αr + (1 α) 2 αr + (1 α)αr + αr 3(1 α) 2 α 2 r + 2(1 α)α 2 r + α 2 r 3(1 α)α 3 r + α 3.. αr n=1 (1 α)n 1 α 2 r n=1 n(1 α)n 1 α 3 r n=1 1 n(n + 1)(1 α)n 1 2 lim n αr( 1 ) = r α α2 r( 1 ) = r α 2 α 3 r( 1 ) = r α 3 What we see is that the α factor in the state value estimate is an approximation error that works as a discount factor until it converges towards zero. At this point there is only the actual non-discounted future rewards left, since γ = 1. In other words in the very long run we will run into convergence problems with γ = 1, unless our reward structure secures convergence in another way. I let the two agents from fig keep running for 1. episodes to see if the γ = 1 agent ran into problems when all SAP values started to look the same, and the convergence problems showed up around episode x 1 4 Total Steps 3 Accumulated steps gamma =.9 gamma = Figure 3.6: As you can see, I have zoomed in on the point where the state-values for the γ = 1 agent starts to look the same, making the long path as good as the short one! A practical problem of such a simplistic reward function is that with very large state spaces, the agent seldom runs into a reward. In the initial fase the agent has absolutely no idea about what to do - it will simply act at random. This can make an RL agent very ineffective, because the initial convergence fase is too long, or in worst practical case, a RL robots battery runs dry before it has received a single reinforcement signal. 27

32 Another problem is that for one-step RL methods like SARSA and Q-Learning, it will take at least as many episodes as the amount of steps in the optimal solution for the goal reward to propagate back to the start state, before the agent is able to use its experience about the goal from the beginning of an episode. This can be solved by extending the one-step methods so that several states are credited for reaching the goal maybe even the start state. We will return to these so-called eligibility trace based methods in chapter Comparative analysis of different reward function configurations In general we can speed up convergence by punishing taking a step. The agent receives more valuable experience because all actions now have a none-zero reward value. The punishment will make the agent explore because unexplored SAPs will be zero where as explored SAP will be mostly negative. In addition it will not walk around blindly until it coincidentally bumps into the goal. This should lower initial costs. We could also punish even harder an action such as bumping ones head into the wall or into an obstacle The second punishment should inform the agent about the environment dynamics, making convergence faster. I tested the three reward functions on a 15x15 maze using QLearning. The performance of the reward function was average over 2 runs. Each run was 2 episodes. 7 CPU time per Episode 5 x Total No. Steps Total CPU Time Reward 4 3 Accumulated reward Accumulated CPU time/sec Figure 3.7: The results of five Q-learning agents with different reward functions and discount factor solving a stochastic 15x15 maze. In figure 3.7 one sees the speed of convergence for the five different configurations of reward function and discount factor. All five agents use one-step Q-Learning and epsilon- 28

33 greedy exploration with ɛ =.1. 1 The agents objective is to find the shortest path through the maze. The question is, how to design our reward function to meet this objective. Here are the different configurations of the reward function and discount factor: Red γ =.95, Goal = 1, Step =, Obstacle =, Wall =. Green γ =.95, Goal = 1, Step = -1, Obstacle = -1, Wall = -1. Blue γ = 1, Goal = 1, Step = -1, Obstacle = -1, Wall = -1. Cyan γ =.95, Goal = 1, Step = -1, Obstacle = -1, Wall = -1. Magenta γ = 1, Goal = 1, Step = -1, Obstacle = -1, Wall = -1. The are several things to observe on the graph. First, the graph indicates that the differentiation between obstacles and steps does not work. Bumping into a wall or an obstacle is no worse than taking a step in the wrong direction, if the task is to find the shortest path in steps to the goal. It simply confuses more than it informs. Second, we can observe that discounting the future in a finite state space slows convergence a bit, since it distorts the value distribution in the maze a bit. The discount factor is necessary for the red binary reward function, as we observed in the last section. Third, punishing a step efficiently speeds up initial convergence, however it also makes suboptimal holes appear more easily in the Q-value table. This makes final convergence to the optimal Q slower than with the binary reward function. In CPU time this means that the binary reward function ends up spending the same amount of time as the green reward function, despite its initial problems. This is the first hint we get that we should keep the reward function as simple as possible. Tell the agent what to do not how to do it.. The initial convergence problems for the binary reward function grows exponentially with the problem size, and the standard reward function, that I have used is therefore the green reward function with γ =.99, to limit the size of the Q-values. Standard settings will from here on refer to α =.2 and the green reward function with γ =.99. One should be aware of the size of γ, since small γ values can eat up distant goal rewards before they have time to propagate back to the start state. The problem of initial slow convergence is not just a problem for the red reward function, but a general problem in reinforcement learning. It can be solved by efficient exploration strategies. We will therefore look into exploration strategies in the next chapter. 1 epsilon-greedy exploration is when you act random with a probability of ɛ and otherwise act greedy. We will be introduced in the next chapter. 29

34 Chapter 4 Exploration in Learning - How to act and efficiently accumulate experience An agent will never fully know the environment dynamics prior to acting when solving real world problems. It therefore needs to explore the environment to gather a representative collection of experience, which it can use to approximate the dynamics. When trying to make a strategy for efficient exploration, we run into to a central theme of learning, the dilemma of exploration versus exploitation. We must remember that the policy has two objectives: first exploration, which has the purpose of minimizing learning time. 1 We build an exploration strategy into the policy so that we get the experience needed to reach optimality. Optimality is definitely not always the primary goal. In general we just need to know that as long as we keep on experiencing we will get better until we ultimately reach optimality. The second objective is reward maximizing and cost-minimization. We use(exploit) our knowledge so that we select actions that minimize costs and maximize total reward. The problem is that exploration is costly and that exploitation alone stops the learning process. However Sebastian Thrun showed in 1992 that efficient exploitation and efficient exploration are interdependent - it pays off to exploit the knowledge one has gathered while exploring, so that you don t explore irrelevant parts of the environment - at the same time exploitation will never be cost-minimizing without exploration. 2 Depending on how often we update the policy, we have to be careful by not being too explorative and remembering that we could be controlling a real world robot, that has to survive and solve the problem at hand in limited time. One of the aspects we will test in this chapter is the optimal degree of explorative behavior. 4.1 Converging towards optimality through α annealing We don t want to keep on exploring, as the agents behavior should converge. Some researchers in reinforcement learning anneal the learning rate to secure convergence. Marco Wierings(1999) has a local annealing procedure of the learning rate, so that the decrease in 1 Thrun a problem we all face on an everyday basis, although it seems like most people stop learning at some point, losing their imagination and their desire to explore new ways, taking a exploitative, greedy approach to life. 3

35 the learning rate depends on the frequency of the current SAP - in other words how unexplored the current territory is. ɛ = a(t) = 1 t β, 1 2 < β 1, andt = freq(s, a). (Wiering 1999) This works for stationary environments like ours, as we approximate the dynamics behind a single SAP better - at some point - the more we visit it. For none-stationary environments the assumption of convergence as a function of the number of SAP visits would decay with time. Instead we would have to anneal α as a function of an indicator like δq. The size of δq tell us if we had the right ideas about how the world works. α-annealing can however be troublesome. When using the QLearning control algorithm for an episodic task, the states around the start state will be visited so frequently that when the goal reward finally arrives it could have little influence on the Q-values, because the learning rate is already close to zero. Wiering has showed, that convergence takes place when using α-annealing, but it also takes place without annealing. In many practical applications the learning rate is constant for the same reasons. The last chapter depicted, that the learning rate does no harm to the convergence of the Q-values. There does not appear to be any reason for annealing the learning rate. On the contrary it can cause problems. In addition, why should knowledge received later have less influence? 4.2 First test for Exploration Strategies Exploration strategies can be split up into two categories. Strategies that explore the environment based on a randomizer, called undirected exploration, and strategies that try to maximize an exploration-specific value function based on a predicted knowledge gain. Instead of learning to solve a control problem, the agent learns to explore efficiently by maximizing knowledge instead of total reward - this is called directed exploration. We will now first test the different exploration strategies and find the optimal balance between exploration and exploitation for each strategy. Secondly we will compare the methods. And at last we will discuss alternative exploration solutions to the initial convergence problem. The 15x15 maze with 8-connectivity in figure 4.1 was used with two different reward functions, the standard reward function and a binary reward function where the transition to the goal gives a reward of 1 and all other transitions give a zero reward. The transition probabilities were [2,5,86,5,2]. The reason for the modest size maze is due to uncertainty - I had to repeat the experiment, that ran over 1 episodes 1 times for each parameter value, several times before I was able to see the tendencies that I was looking for. That takes time! 4.3 Undirected Exploration Undirected exploration is, as mentioned, random based exploration. Again we must remember to explore as well as exploit. We can separate the two objectives or follow both at the same time. 31

36 Figure 4.1: The first maze used for the exploration tests. The first undirected strategy separates exploration and exploitation, the second integrates them ɛ-greedy Exploration The simplest explorative policy is called ɛ greedy, as it explores at random, based on a uniform probability distribution with a probability of ɛ and exploits according to arg max a A(s) E π{q π (s, a)}. with a probability of 1 ɛ. We now take a look at the size of ɛ to pin point the optimal degree of exploration versus exploitation. I have solved the maze with 11 different sizes of ɛ. The optimal size of ɛ will vary according to the problem size, but the tendency is clear. As you can see ɛ =.5 gives the best result both in the amount of steps the agent used to solve the maze and in CPU time spent. In general values between.5.1 seems to give nice results with little variance. Smaller values around <.1 are very unpredictable, because they are more vulnerable to suboptimal solutions, sticking to them for too long, as one can observe in figure 4.2. At some point the Q-values have stabilized around optimality and we don t want to continue exploring in order to secure convergence, thus we should anneal ɛ. I have not tested 32

37 3 x 14 Total No. Steps 4 Total CPU Time Accumulated steps Accumulated CPU time/sec 25 2 e = 1 15 e =.9 e =.8 e =.7 e =.6 e =.5 1 e =.4 e =.3 e =.2 e =.1 e = Figure 4.2: The maze is solved using the standard reward structure. The convergence speed and the costs of learning varies with ɛ. x 1 4 Total Reward for 5 episodes 4 Total CPU time for 5 episodes e =.5 3 Accumulated reward Accumulated CPU time Epsilon Epsilon Figure 4.3: CPU time and total reward as a function of ɛ - One can observe that exploration and exploitation have to be balanced in order to optimize behavior. 3.5 x 14 Total No. Steps 4 CPU Time Accumulated Steps Accumulated CPU time/sec runs with e = Figure 4.4: This is the 1 runs using ɛ =.1, as you can see the variance is large. 33

38 different annealing functions, since the ɛ-greedy strategy is of no particular interest, due to its limitations. One should notice that even though one doesn t anneal ɛ, the degree of exploration per episode will still decrease over time because the agent solves the maze with fewer and fewer steps Boltzmann-Gibbs Exploration The problem with ɛ-greedy exploration is that no matter what we know, the probability distribution is uniform. The Boltzmann-Gibbs strategy deals with these weaknesses. They use a Boltzmann distribution instead of a uniform distribution and thus integrate exploration and exploitation, since the Boltzmann probability distribution makes the action selection at all times. There will be a lot of exploration when the Q-values are similar and less so when Q-values are different. Each action a has a probability P (s a) of being chosen: P (a s) = e Q(s,a)/T i A(s) eq(s,i)/t, T > The temperature variable T influences the degree of exploration, because a large T will even out the action probability distribution. Boltzmann exploration is very sensitive to the reward function. We notice that because of the exponential function if all Q-values are negative, there will always be a high degree of exploration independent of T since T is positive, and to some degree independent of the size of difference in Q-values. If we know that Q-values will be negative we cannot just make T negative since large, negative Q-values then would be preferred, but we could try to make T small. This looks like a very bad idea, if you look at the graph on figure Standard reward function interval exp(q(s,a)/t) Binary reward function interval Q(s,a) Figure 4.5: The temperature variable T defines the probability distribution in a given Q-value interval. You should thus have an idea of the location of the Q-value interval, before you define T. One can observe that for T 1 exploration practically stops outside the interval [ 1; 1]. The probability distribution will be the extreme opposite of uniform especially with Q- 34

39 values >1. This makes the agent vulnerable to suboptimal solutions, and even suboptimal state space holes, therefore making the agent loop for a long time. I made several experiments with Boltzmann-Gibbs exploration. First I tried to solved the maze in figure 4.2 with the standard reward function. This was troublesome because the Q-value interval was too big. The agents were doing almost 1% uniform exploration independent of T settings for all the very negative Q-values. For T < 5, the higher Q-values resulted in suboptimal loops. We can observe both Boltzmann-Gibbs problems on figure 4.6. x 15 Total Reward T values 1 4 CPU Time Accumulated reward Accumulated CPU time/sec Figure 4.6: At T=3 the high probability differences was struck by a suboptimal loop hole. When all Q-value estimates start to converge, a state will have eight SAPs, where the Q-values are within an interval that is decreasing. That will increase exploration, because the Boltzmann agent cannot focus on the optimal Q-value. For very large state space where neighboring Q -values will be very close, Boltzmann agents will have problems focusing on the optimal action. This could result in agents starting to zig-zag around the optimal path. When the reward function changed from the standard to the simple binary version, Boltzmann suddenly proved more effective. As it has a hard time with very negative Q-values. Boltzmann exploration reminded us about a very important rule: tell the agent what to do, NOT how to do it! By using the binary reward function together with the discount factor, we can still make the agent find the fastest way through the maze, even though initial convergence is slower. A lesson to remember, in relation to using Boltzmann-Gibbs, is therefore "keep it small and positive". However I still had problems. It seemed like the T simply couldn t get small enough! This could not be true, as that meant no exploration. Why did exploration not pay off? The problem with the maze in figure 4.2 together with the binary reward function, is that a greedy strategy reaches the optimal policy. Since exploration is more costly than exploitation, it does not pay off. A greedy strategy actually results in random exploration until the 35

40 agent has found the goal. This happens because greedy action selection is a uniform probability distribution between all greedy actions - in the beginning all actions are greedy. There are no suboptimal holes in the state space, so nothing can go wrong. The second problem I set up was therefore harder. Solid obstacle Start 4 High Reward Goal 2 Low Reward Goal Penetrative obstacle Solid outer wall Figure 4.7: The two paths in black are the optimal solution and the best solution without the high reward goal. This maze has three goals, where the optimal goal is hidden. Now we have two terminal goal types and two obstacle types. Two easy accessible suboptimal goals giving a low reward (zero), and one optimal high reward(1) goal hidden behind a new type of obstacle, which is painful (-1) but penetrative. A step gives -1 in reward. It pays off to go for the high reward goal, but you will never find it with a greedy approach. The results in figure 4.8 from the hidden goal task, shows the dilemma of the exploration/exploitation trade-off. The greedier agents never find the high reward goal, the very explorative agents find it but do not use the knowledge to optimize their behavior. The balanced agents find it and use their knowledge to maximize reward: exploration and exploitation. 36

41 Accumulated Reward x Total Reward Accumulated CPU time/sec CPU Time T values Figure 4.8: The Exploration/Exploitation trade off stands out. The agent needs to do both. 37

42 4.3.3 Combination Strategy, Max-Boltzmann When we combine the two exploration strategies by exploring with an ɛ probability using the Boltzmann distribution instead of an even distribution and otherwise act greedy. We are separating exploration and exploitation to a degree that gives us the ability to always focus on the best action, like ɛ-greedy, while exploring qualitatively like Boltzmann-Gibbs. The result of the comparative experiment can be seen in figure 4.3.3, the combination strategies proved successful. The combination strategy was like ɛ-greedy able to focus on the best choice in the beginning when all Q-values were negative. Boltzmann had, as expected, problems with negative Q-values. When the Q-values were starting to converge, the combination strategy was like Boltzmann dynamic and able to shift focus to exploitation, and therefore ɛ-greedy was left behind x 15 e=.5 T=5 e=.6 v T=5 e=.6 v T=1 T=8 T=1 e=.6 v T=15 Total Reward 1 3 CPU Time.4 Accumulated reward.2.2 Accumulated CPU time/sec Figure 4.9: Figure 4.9 shows very clearly how the combination strategy is able to take the best from the two strategies. It does not suffer from the initial high costs of Boltzmann, or the convergence problems of epsilon-greedy. But as one can observe, it also drags along one of the down side of Boltzmann-Gibbs: suboptimal holes when using small T-values. The combination of ɛ =.6 and T = 15 performed best measured in total reward as well as CPU time. From here on the I will refer to the combination strategy as Max-Boltzmann. 38

43 4.3.4 Directed Exploration - learning exploration heuristics Why not learn to explore? If we don t take the cost minimizing action we should take the action that teaches us the most. This is what directed exploration is about. One uses reinforcement learning methods to learn to explore in a way that optimizes the knowledge gained per action executed. In an unknown environment one cannot know prior to acting where to gain the optimal knowledge. However we can build a heuristic exploration reward model R Exp that materializes an assumption about where to look for optimal knowledge detailing the environment dynamics and rewards. When we are trying to learn a control task, we must learn to estimate the transition probabilities and the reward function. In exploration learning we only need to learn the exploration reward model. This is because the exploration reward function is not related to the transition (s, a) a but only to the execution of (s, a). This also mean that the reward signal is never delayed, so there is no credit assignment problem in exploration learning. The update of the exploration Q table Q Exp is thus without a learning rate. I use QLearning as the learning method: Q Exp (s t, a t ) = R Exp (s t, a t ) + γ Exp max a t+1 Q Exp (s t+1, a t+1 ); We can observe that the reward is replaced at each update. Another important observation is exploration discount factor γ Exp. This influences how global a scope the exploration strategy should have. There are three well-known exploration heuristics, which have been implemented in slightly different version by several researchers such as Thrun(1992), Storck et al.(1995) and Wiering(1999) amongst others. I use Wierings implementations because they are the most intuitive. Wiering himself used the reinforcement learning method Prioritized Sweeping whereas I have used simple one-step QLearning. Frequency Based exploration algorithm The first exploration reward model is based on the frequency of SAP executions. The assumption is that a SAP, which has been executed many times, probably has a Q-value that approximates the rewards and transitions probabilities behind the SAP well. Therefore the SAP is given a very negative exploration reward, since we don t gain new knowledge by executing it. The exploration reward function R E (s, a) is given by R E F req(s, a) (s, a) = c F, c F = scaling constant. (4.1) Frequency based exploration is known to be very sensitive to γ Exp. Frequency based exploration is a sample based way of running systematically through the state-action space. Surprisingly it seems to have fairly low exploration costs, as we will see. Recency Based exploration algorithm This model selects the action that has been executed least recently. R E (s, a) = t c R, t = global task time counter, cr = scaling constant. 39

44 The time variable t is a discrete step counter in the implementation, which is reset only before the first episode. The current implementation will suffer from overflow if the RL agent were to run for a very long time period. If the environment is non-stationary, recency based exploration is an obvious choice, and therefore also not very likely when dealing with stationary environments like we do. Error Based exploration algorithm This model selects the action where the approximation error δq is biggest, as we assume that we still have a lot to learn in approximating environment dynamics. R E (s, a) = δq c E = Q t+1 (s, a) Q t (s, a) c E, c E = scaling constant. This is like using the Bellman residual for exploration guidance, except that a positive δq gives a higher reward than a negative. This strategy therefore prefers to gain knowledge about positive control rewards. This keeps the exploration costs down and the agent away from task-irrelevant (high negative rewards) state space regions. If the environment had some regions where the dynamics and rewards continuously changed at random, an error based agent would keep on exploring at this random spot. However the RL designer should not define a state such that transition probabilities and rewards change chaotically. Even a human would have difficulties learning under such conditions Experiments with Directed Exploration I have made several experiments with the directed exploration methods in order to understand interplay between the problem structure and the parameter settings of the exploration methods, especially frequency based exploration which proved most successful. I used a tougher version of the hidden goal task, where the high reward goal is even harder to find. 3 The results are shown in figure 4.9. In general Boltzmann and Max-Boltzmann converged faster than frequency based. Error based exploration displayed convergence problems when the agent needed to go through a lot of pain to reach the high rewards. This is because of the build in preference for a positive knowledge gain. This is a problem, since exploration methods need to be able to deal with more advanced reward functions, if they are to be suitable for complex problem solving. The second experiment was a random generated 1x1 maze solvable using 8 connectivity. I averaged performance over 1 runs. This is a much larger problem, with 8. different state/action pairs. This would reveal how good the methods scaled. The results are shown in figure 4.1. Suddenly Frequency based exploration showed its worth on large scale problems. In addition Thrun(1992) showed that directed exploration methods run in polynomial time compared to undirected methods that use exponential time. 3 The problem can be found in the problem library of the simulator. 4

45 2.5 3 x 15 2 Freq4 c Boltz T1 Boltz T3 eboltz T3e.8 eboltz T4e.8 Freq6 c5 F5 c5 Recency Boltz T2 Err c1 Err c5 Total Reward Total CPU Time Accumulated Reward Accumulated CPU time/sec Figure 4.1: A tough version of the hidden goal task. Frequency based methods does not seem to have the same high initial exploration costs as Boltzmann that as always has problems with negative rewards. The Frequency based methods converges so suddenly because exploration is stopped at some point to make them converge. If this is done too fast convergence will not yet have occurred. 4 episode seemed to the minimum number of episodes required. 4.4 Discussions The overall objective of this chapter is to show the importance of exploration in learning and adaptive control. All exploration strategy tests of the exploration/exploitation trade off have been tested with QLearning alone. In addition some tests have been carried out using only one problem size and configuration. These facts limit the conclusions that can be drawn. However, the objective of this chapter is not to give a solid proof for the optimal exploration parameters and strategies. The scope of the project makes room for comparative tests that review common exploration strategies and confirm already proven results. Maybe some tests have shown interesting tendencies that others could follow up on. Sutton and Barto wrote in 2, that they did not know of any careful comparative studies of ɛ-greedy exploration and soft-max exploration (Boltzmann-Gibbs). The studies I have made in this chapter cannot be characterized as careful, but they give some qualitative hints, that could be useful. In general Boltzmann-Gibbs can be very unpredictable. The dynamic combination of exploration and exploitation when using a large Q-value interval than [ 1, 1], sometimes fails badly. We could observe this unpredictability in figure 4.8 and 4.9, when the graphs suddenly dropped dramatically. In these situations the Boltzmann distribution can output such uneven probabilities, that even 64 bit double values cannot represent such small probabili- 41

46 x Reward per Episode 14 2 x 17 Total Reward 7 x 14 6 CPU Time Freq 4 C= Boltz T=1 Boltz T=3 eboltz T=7 e= Reward Accumulated reward Accumulated CPU time/sec Figure 4.11: The hidden goal task has a very small state space. This is a 1. state random maze with 8. different SAP. Frequency based methods suddenly becomes clearly superior to undirected methods. Still showing a remarkable cost-effective initial behavior will at the same time converging faster than anyone else. This shows that directed exploration indeed is superior when scaling up problems, making directed exploration essential for future AI applications. ties. The agent can spend ages changing the probability distribution to a degree that makes it able to escape a suboptimal loop. ɛ The ɛ-greedy strategy on the other hand always gives a probability of (1 ɛ) + no. actions to ɛ one greedy action, and to the rest. It therefore explores and exploits at all times, no. actions unless you anneal ɛ. This makes it a lot more stable and predictable, despite the random based exploration. Sebastian Thrun showed in 1992 the superiority of directed exploration. My experimental results show that frequency based and error based exploration is better for large scale problems. However on small scale problems Max-Boltzmann and Boltzmann seems to converge faster, because of their dynamic exploration-exploitation relationship. The optimal choice of exploration method is not only a matter of scale, but also depend on the reward function and the problem structure. Alternative Exploration Leslie Kaelbling and William Smart(22) made robot experiments where a human is doing the action selection and the RL agent is only learning. This alternative exploration strategy can speed up initial learning considerably, if the human knows where to gain valuable task knowledge. At some point when the human estimates that it has showed the robot what it 42

47 knows, the action control is returned to the robot. This is an efficient way to deal with initial convergence problems, which can be considerable for large state spaces with few none-zero rewards. The agent will however always have initial problems when starting in unknown territory - just as a human would. Methods for passing on external knowledge to RL agents prior to acting are therefore essential. Efficient exploration rules should be able to take this knowledge into acount 4 Kaelbling and Smarts human action selection strategy is only possible because of offpolicy methods like QLearning. This interplay between exploration strategy and learning algorithm is also evident in Boltzmann-Gibbs and Error-based exploration. Here, exploration would change if QLearning was replaced with SARSA. The next chapter looks at the difference between off-policy and on-policy reinforcement learning methods. 4 Thrun 1992, p23 43

48 Chapter 5 Offpolicy methods versus Onpolicy methods 5.1 Why is this an important issue? In many situations it is useful to have methods that learn about one behavior while actually following another. This is exactly what off-policy methods do. Off-policy methods like QLearning approximate the optimal policy π while following π. On-policy methods like SARSA approximate and follow π. What we will see is that the two approaches both have advantages and disadvantages. For real world controllers it can have fatal consequences if Q-values do not give a realistic picture of how the robot acts. If the optimal path through the state space passes by a whole region of fatal states there will be a very realistic chance that an agent in a explorative, initial stage will take an action leading into such a fatal state. The agent will not see any warning signs in off-policy Q-values when it uses them for control. 5.2 The cliff-walking task Goal Start Terminal abyss Figure 5.1: This is how a cliff looks in a grid world. The red path is the converged QLearning path, the green path the converged SARSA path. SARSA is realistic about its exploration behavior and stays away from the edge. 44

49 This experiment by Sutton and Barto(1998) is called the cliff-walking task, because it simulates a robot walking from a starting point, along a cliff edge to a goal. The robot must find the shortest path from start to goal, which is along the edge of the cliff. If the robot steps over the edge and into the abyss, the episode is terminated with a negative reward of -1. There are two ways in which the robot can step over the edge, a deliberate explorative action or because of - at the time of the action - unknown environment dynamics. The environment dynamics could take the form of wind, slippery rocks or internal mechanical problems in the robot. It is important to understand that off-policy methods take environment dynamics into account, just not the deliberate suboptimal explorative actions. So when the agent falls over the cliff because of the wind and actually performs action a instead of the deliberate chosen a, it will be reflected in the off-policy Q-values, since it is Q(s, a) and not Q(s, a ) which is updated. Reward Reward per episode Accumulated Reward x 15 Total Reward Accumulated CPU time/sec Total CPU time QLearning SARSA Figure 5.2: SARSA is realistic and stays away from the edge. That pays off. The cliff-walking task is solved using ɛ-greedy exploration with a constant ɛ =.1. As one can observe in figure 5.1, there are 2 discrete steps along the cliff edge. The optimal path is along the edge. The exploration setting ɛ =.1, means that agents pursuing the optimal path will fall over the cliff edge about every second episode. This fact can be observed in the task statistics on figure 5.2, where QLearnings reward per episode is oscillating between the optimal path and falling over the cliff edge. As you can see SARSA stays away from the edge following a suboptimal path, because the exploration mistakes are reflected in its Q-values. I have set transition probabilities to [ ] meaning that the environment is close to deterministic. Otherwise the environment dynamics would have meant that the optimal path was not along the cliff edge. Because we keep on exploring with a constant ɛ, π never converges to π, and SARSA keeps on walking the suboptimal path, just as QLearning keeps falling over the edge every second time. The total reward graph tells the story. The total CPU time graph shows that QLearning is always taking a shorter path than SARSA. This scenario is not very realistic since ɛ is not annealed to secure converges. If we use an exploration strategy such as the efficient Max-Boltzmann (ɛ =.6,T = 15), that hardly ever takes an extremely bad action, instead of ɛ-greedy, the picture changes radically. I let the agents run 4 episodes each, and averaged performance over by repeating the experiment 5 times - the results are shown in figure 5.3. QLearning converges faster to the optimal policy than SARSA, and it doesn t fall over the cliff edge more often than SARSA, not even in the initial fase. SARSAs cautious attitude should pay off in the beginning, but the dynamics 45

50 Reward per episode x 14 Total Reward QLearning SARSA 35 Total CPU time Reward Accumulated Reward Accumulated CPU time/sec Figure 5.3: The Boltzmann distribution changes the exploration-exploitation relationship after finding the optimal path. Consequently QLearning does not fall over the edge after a while. SARSA stays off the edge for a while longer. of the Boltzmann distribution saves QLearning. Is the cliff-walking task as it was set up by Sutton and Barto (1998) totally meaningless, and is it just a matter of exploration and not a question of learning? The cliff-walking task does have an important point - the task is just too simple and the state space too small to prove the point. G.A. Rummery(1995) tested in his Ph.D. thesis off-policy and on-policy methods on large complex, non-grid world problems. On-policy methods proved most successful, because the large initial discrepancy between π and π made off-policy methods lack control skills. This being said, off-policy methods still seems to be the best choice in many situations. This is because a lot of tasks do not have high exploration costs. Therefore you can - as Kaelbling and Smart did - put the agent in all the instructive situations, good as well as bad, without worrying about the Q-values reflecting your initial, suboptimal behavior. Figure 5.3 also shows how SARSAs Q-values remain suboptimal for 35 episodes longer than QLearning. Furthermore, in some earlier runs the integration of the initial suboptimal explorative behavior in the Q-values, meant that SARSA stuck to the suboptimal solution for more than 15 episodes! Convergence in general is simply faster with QLearning. The issue of off-policy versus on-policy, is as I see it, an issue of prediction versus control. If control and cost-minimization is more important than fast convergence, then on-policy is the right choice. If optimal prediction is of highest priority and convergence time is more costly than exploration costs, then off-policy is the right choice. 46

51 Chapter 6 Eligibility traces - Who is to blame? Up to this point, the agent has not used its experience effectively. We have not taken into account that when learning to solve complex problems, it is many times a series of actions that leads to rewardable events. In addition the reward signal from the environment is sometimes delayed in relation to the action or the series of actions that triggered it. So far we have used QLearning and SARSA, that only updates Q(s t, a t ) when receiving r t+1. But what about Q(s t 1, a t 1 )? r t+1 is also part of its future rewards - and the same can be said about Q(s t 2, a t 2 ), although the uncertainty about the importance of r t+1 as part of the value estimate is now greater. This issue is called The Temporal Credit Assignment Problem and it is the major theme of this chapter. There is also another type of credit assignment problem called structural credit assignment. It deals with exactly the part of the learning system that is to blame. In advanced learning systems, there can be a series of internal decisions behind one performed action. When the agent receives a reward signal, the right internal decision behind the performed action must be assigned credit. The problem is similar if several actions are performed between each time step. This is the structural credit assignment problem, and it is evident when dealing with very large or even continues state and action spaces. When using tabular representation, one action per time step, and Q-value based action selection like the implemented simulator, structural assignment becomes trivial. 6.1 n-step returns and TD(λ) learning QLearning and SARSA are both based on the one step temporal difference method called TD(). Lets review TD(). V π (s t ) = (1 α)v π (s t ) + α(r t+1 + γv π (s t+1 )) (6.1) Alternatively we could wait another time step before we updated V π (s t ): V π (s t ) = (1 α)v π (s t ) + α(r t+1 + γr t+2 + γ 2 V π (s t+2 )) (6.2) Or we could wait n time steps: ( n ) V π (s t ) = (1 α)v π (s t ) + α γ k 1 r t+k + γ n V π (s t+n ) k=1 (6.3) 47

52 We will call R (n) t = n k=1 γk 1 r t+k + γ n V π (s t+n the n-step return. We can now combine different n-step algorithms in relation to the problem nature, so that R (combine) = 3 4 R( 1) R( ). Finally we can combine fractions of all n-step return algorithms in a Sutton s(1988) algorithm called the λ-return algorithm, R λ t (s t ) = (1 λ) n=1 λ n 1 R (n) t, λ 1. (6.4) If we now replace the one-step return in the TD() algorithm with theλ-return, we have Sutton s famous TD(λ) algorithm. We are working with episodic tasks so n T, where T is the last discrete time-step in an episode. At time t all n step returns where n T t, will thus be a Monte Carlo return not using another state-estimate, but only experienced rewards. We will call the Monte Carlo return at time t for Rt Monte. We now revise the λ-return algorithm for episodic tasks: T (t 1) Rt λ (s t ) = (1 λ) λ n 1 R (n) t n=1 + λ T t 1 R Monte t, λ 1. (6.5) The one-step return has a weight of (1 λ) and leaves the λ fraction for the rest of the n-step returns to share. Each n-step return algorithm takes a (1 λ) share and leaves a λ fraction for the next n+1-step return algorithm. The T t th -step return (the Monte Carlo return) takes the whole λ share (λ T t 1 ) because it is the last return algorithm, that is why it is outside the summation. The λ-return algorithm is built upon the normalized credit assignment distribution function f(n) = (1 λ)λ n 1 pictured on figure Normalized Weight Discrete time steps Figure 6.1: 6 λ values are plotted: [.7,.75,.8,..,.95] A large λ value makes the distribution more uniform. At one end we have λ = which transforms the TD(λ) into the one-step TD() algorithm. 1 At the other end we have TD(1), which is actually the Monte Carlo learning method. 1 This is why QLearning and SARSA are called TD() control algorithms, because they are based on TD(λ) with λ =. 48

53 Variable λ Sutton and Singh (1994) theoretically explored the idea of improving convergence by letting λ vary as a function of state. If a state value estimate V (s t ) was believed to be close to the truth, λ(s t ) was set to zero, thereby giving the rest of the distribution weight to the particular n-step return that was using this perfect state estimate. You can observe the distribution effect on figure 6.1. On the other hand, the λ of a state, which state estimate seemed very Using n step return with good state estimate Putting no weight on n step return with bad estimate Normalized Weight Normalized Weight n Lambda.5 2 n Lambda Figure 6.2: To the left V (s t+2 ) is a very good estimate and the n-step return containing V (s t+2 ) is therefore assigned the rest of the credit. To the right a situation where V (s t+2 ) is a very bad estimate and the n-step return containing V (s t+2 ) is therefore assigned no credit passing its share on to the rest. error-prone was set to 1. This has the consequence that the state estimate will not be used to approximate other state estimates. The λ return algorithm for episodic tasks thus changes appearance to R λ t (s t ) = T (t 1) k=1 R (k) k+t 1 t (1 λ t+k i=t+1 λ i + R Monte t Variable λ has not yet been used in practical applications. T 1 i=t Revising the credit assignment distribution function in TD(λ) λ i, λ 1. (6.6) The questions is how realistic is the λ credit assignment distribution(cad) in relation to approximating the true value of Q or V π? The distribution has a built-in assumption about credit assignment which does not seem to necessarily be true for real world problems. With this distribution action a t is always credited the most for reward r t+1 independent of the 49

54 value of λ. However, for many real world problems, we know there is a temporal delay of the reward signal. This means that a t might not have anything to do with r t+1 at all, except that it was the last action executed before receiving r t+1. However, this observation is only problematic if the environment dynamics clearly does not hold the Markov Property. The Markov Property means that the idea of delayed rewards in a Markov world are not truly delayed. A delayed reward is rather an event that inevitably will happen if you enter some state simply because transition probabilities from here on will lead you to this event. However, the reward signal is still related to the last state transition into the event state, and in this regard the reward is not delayed. I have graphically illustrate the point in figure 6.3, which is analogous to the bicycle task when an agent passes a point of no return in a fall. a 1 a 1 a 2 s 5 a 2 a 1 s 2 a * s 3 a 2... a n 1 a n s 4 a n a n (t+1) n t a n (t 1) a n (t 2) n 1 s 6 a a a 1 a 2 a n a n 1... a n 1 a n... s 7 s 1 Figure 6.3: State s 7 is a terminal state of for example the bicycle task, where the agent falls and gets a negative reward. The crucial action is a, since the agent inevitably ends up in the painful state s ground, if a is executed. What is the true value of the SAPs of the story? It is obvious that the SAP (s 2, a ) should be assigned credit for the event. Q(s 2, a ) should reflect the inevitable future reward R(, s 7, ) 2. However, although all action selections following (s 2, a ) do not influence the course of events, the SAPs following (s 2, a ) definitely have R(, s 7, ) in their future rewards. That is what counts. Independent of our knowledge about the environments inevitability, if we believe in the Markov Property, then the probability of an event being important for the Q -estimate will always become smaller the longer we back up in time from this event. This is the reason why the λ CAD is distributed the way it is. With MDPs only CADs which has maximum at t-1 for events at time t makes sense. However the λ CAD still has unnecessary limitations. In TD(λ) we can only adjust the CAD by setting the λ factor. What we will see is that 2 The * is a don t care. 5

55 TD(λ) works best for λ >.8. This is due to the fact that with a large λ you credit actions within such a large time interval after an event that you are sure to credit the responsible actions. This has the consequence that the crucial action or actions that really are responsible will always be within this time interval when the event happens and therefore get the most credit in the long run, securing convergence. However, even though we stick to the Markov Property assumption, this distribution function does not seem adequate for large, complex real world problems. It is simply not flexible enough. 6.2 Proposing an Alternative CAD Function When varying the λ factor in TD(λ), we make the distribution more or less uniform as shown in figure 6.1. What if the crucial action putting us on the inevitable path to a negative reward was four discrete time steps back, but five and six steps back we actually selected the optimal actions? With the λ distribution we cannot evenly credit four steps back without also crediting step five and six. Inspired by fuzzy logic membership functions, I have reflected a lot upon distributions and there relationship to the Markov Property. First I constructed an asymmetric gaussian combination function, but found it to be based on a non-markov assumption about the environment dynamics. We must remember that with MDPs only distributions with a maximum at t 1 for events at time t, make sense. The maximum can however be stretched back to the crucial action, meaning t 1 does not need to be a unique maximum. With this in mind I implemented a sigmoidal function G sig, which has exactly the flexibility we are looking for. G sig (t, λ, θ) = 1, for λ <. 1 + e λ(t θ) Since this is a probability distribution it must be normalized: G sig (n, λ, θ) = G sig (n, λ, θ) k= G sig, for λ <. (6.7) (k, λ, θ) The two parameters λ and θ are problem state depended. The problem structure creates the CAD. We could vary the distribution as a function of state like Sutton and Singh did with the variable λ-return, then λ and θ would be functions. For now we keep them constant. We can now combine all the n-step returns using the new CAD, G sig, in a new sigmoidal return algorithm: T t R sig t (s t ) = G sig (k 1)R (k) t (6.8) (T t) where R t k=1 is Rt Monte. In the sigmoidal distribution, we are able to change not only the variance of the distribution of credit assignment, but also stretch the focus of attention out to the crucial action. The θ parameter controls when the graph drops. A greater θ value moves the crucial action away from the reward event. The λ parameter set the interval between the two extrema of the second derivative δ2 G sig. A large λ value makes the graph decline fast. δt 2 51

56 Figure 6.4: Several variations of G sig. Every color is a single θ value with varying λ. A fall in the bicycle task as described in figure 6.3 is an example of an event where the CAD in TD(λ) obviously is inefficient or plain wrong. When the agent falls and the bicycle hits the ground at time t, then it is inefficient not to stretch maximum credit back to the crucial action a at time t 4. However, if T D(λ) does even out the distribution to credit (s 2, a ) it will also wrongly credit ((s 1, π(s 1 )), thus slowing down the learning process. 3. My sigmoidal distribution on the other hand can tackle this situation efficiently by crediting only the eligible SAPs. The flexibility of T D(λ, θ) makes it overlap T D(λ) as you can observe in figure 6.4. T D(.6, 3) is for example approximately equivalent to T D(.95). The sigmoidal return algorithm is first of all a theoretical construction because it is based on the ability to look forward into the future, past the current time t, and all the way to the end of an episode at time T t. This gives it the same applicational weaknesses as the λ return algorithm. I will thus describe the TD(λ) algorithms based on a backward view of credit assignment, and thereafter show the equivalence between the forward view and the backward view. This will form the basis for creating my own TD learning algorithm based on the backward view implementation of the sigmoidal return algorithm. 6.3 Introducing Eligibility Traces, The Backward View of TD(λ) When using the λ return in TD(λ) we had to wait until the end of an episode to update value estimates 4. This makes iterative online approximation of value estimates, where experience is used right away, impossible. The big advantage of TD learning is exactly that it is online, 3 I know from Randløvs dissertation that the time discretization is fairly fine grained, so several actions are performed after the agent has reached the point of no return in the fall. 4 This could be state values (V(s)) or action values (Q(s,a)) depending on the control algorithm. 52

57 so we have to find an online update algorithm that approximates the λ return. 5 We now introduce a new R + valued variable e t (s) called the eligibility trace of state s. The eligibility trace is updated for all states at each discrete time step according to the following update rule: { γλet 1 (s), if s s e t (s) = t ; (6.9) γλe t 1 (s) + 1, if s = s t, where λ is the trace-decay parameter, and γ the usual discount factor. When using TD() we updated only the previous state value estimate when receiving a reward, but with TD(λ) we update all states with a non-zero trace. In this way we distribute credit to states according to their credit eligibility, which is the current value of e t (s). We will see that the decay of the trace is equivalent to the λ return. The base of the online implementation of TD(λ) is the one-step TD() update as we have seen before, V t (s t ) = α(r t+1 + γv t (s t+1 ) V t (s t )). (6.1) The TD(λ) update is similar but just with the trace: V t (s) = α(r t+1 + γv t (s t+1 ) V t (s t ))e t (s), s S. (6.11) As you can observe, if we set λ =, e t (s) will be 1 in the previous state s t and elsewhere, making eq. (6.1)= (6.11). For λ, notice how the one-step TD error propagates through the non-zero trace states. Observing state s t 4, we see that r t+1 would be in all n- step returns R (n) t 4 (s t 4)for n 4, which adds up to γ 3 λ 3 r t+1 = e t (s t 4 )r t+1. The similarity between the λ return and the trace based TD(λ) is starting to show Proof of Equivalence We will now show that the total episodic update of a state value estimate V t is equivalent for the λ return and the trace based TD(λ). We first notice that if we accumulate all trace updates until time t in state s, we have that, e t (s) = t (γλ) t k I ssk, (6.12) k= where Iss k is an indicator function equal to 1 if s = s t and otherwise. We now have to show the following equivalence, T 1 k= T 1 Vt T D (s) = k= V λ t (s t )I sst, s S, (6.13) 5 My account for the eligibility traces and their equivalence with the λ-return algorithm is primarily based on Sutton s own description in "Reinforcement Learning - An introduction" from

58 where Vt T D (s) is the backward view of T D(λ) using traces, and Vt λ (s t ) the forward view of T D(λ) using the λ return algorithm. We start out by re-writing the left-hand side: T 1 Vt T D (s) = k= = T 1 αδ t e t (s) (6.14) t= T 1 αδ t t= t (γλ) t k I ssk (6.15) Focus is the return δ t at each time step. For each return δ t, credit is assigned to states based on the accumulated trace e t (s). We now change the summation limits to get a forward calculation perspective: T 1 Vt T D (s) = k= T 1 α k= T 1 = α k= k= k (γλ) k t I sst δ k (6.16) t= (γλ) k t δ k (6.17) I sst T 1 Focus is now the state s at time t (s = s t ). We take each visit to s t and summarize the fractions of all the returns δ k, where k t. The mathematical equivalence of the two perspectives is graphically illustrated on figure To finish the proof we just need to show that the λ return on the right hand side of (6.13) can be written in the same form as forward view version of T D(λ) in (6.17). 6 Notice that for each s, only a single term of the (T 1) terms in the following sum has a non-zero value, and that is for s = s t. Figure 6.5 illustrates this by showing that the whole λ return of state s t is calculated in one time step. k=t T 1 T 1 Vt λ (s t )I sst = α I sst (Rt λ V t (s t )) (6.18) t= We take the TD error (R λ t V t (s t )) for the update of a single state s t and rewrite it. t= R λ t V t (s t ) = V t (s t ). + (1 λ)λ [r t+1 + γv t (s t+1 )] + (1 λ)λ 1 [r t+1 + γr t+2 + γ 2 V t (s t+2 )] + (1 λ)λ 2 [r t+1 + γr t+2 + γ 2 r t+3 + γ 3 V t (s t+3 )] + (1 λ)λ n [r t+1 + γr t+2 + γ 2 r t γ n V t (s t+n )] We gather all the r t+1 terms in the first column, all the r t+2 terms in the second column, and so forth.. R λ t V t (s t ) = V t (s t ) 6 The following mathematical account of the proof is taken straight from Sutton and Barto s Reinforcement Learning - an introduction page

59 δ T 1 δ T 2 δ T 3 Return based update of state s looking back from time t. t = δ δ δ 1 γ λ... γ T 3λ T 3 γ T 3λ T 3 γ T 2λT 2 γ λ γ 2 λ 2 γ T 3λT 3 γ T 2λT 2 γ T 1λT 1 δ T 1 T 1 T 1 γ λ γ T 2λT 2 δ T 2 γ T 2λT 2 γ T 3λ T 3 γ T 3λ T 3... γ 2 λ 2 γ λ γ λ δ T 3 γ T 3λ T 3 δ 2... γ 2 λ 2... γ λ... Visit based update of ALL states on a episodic path looking forward from time t, taking k returns into account where k>=t. δ 1 γ λ δ t = t = 1 t = 2 t = T 3 t = T 2 t = T 1 Figure 6.5: The top backward view iteratively calculates the λ return for a state through an episode. The bottom forward view calculates the whole λ return for state s t at each time step. Notice the symmetry - the equivalence of the two views is intuitive. 55

60 + (γλ) [r t+1 + γv t (s t+1 ) γλv t (s t+1 )] + (γλ) 1 [r t+2 + γv t (s t+2 ) γλv t (s t+2 )] + (γλ) 2 [r t+3 + γv t (s t+3 ) γλv t (s t+3 )]. + (γλ) T 1 [r T + γv t (s T ) γλv t (s T )] {}}{ = (γλ) [ r t+1 + γv t (s t+1 ) V t (s t ) ] δ t + (γλ) 1 [r t+2 + γv t (s t+2 ) V t (s t+1 )] + (γλ) 2 [r t+3 + γv t (s t+3 ) V t (s t+2 )] =. + (γλ) T 1 [r T + γv t (s T ) V t (s T 1 )] T 1 (γλ) k t δ k. (6.19) k=t If we now replace R λ t V t (s t ) in (6.18) with the result (6.21), we are done. Because of the order of updates in the online version of TD(λ), it will naturally be an approximation of the λ return. The proof is based on offline TD(λ), where all updates are done after an episode. 6.4 TD(λ, θ) - Using Eligibility Traces The general mathematical idea of switching from the forward view to the backward view is the same. I will thus not comment the following mathematical derivations in the same detail as with the former proof. The structure of the sigmoidal function makes the implementation of the backward view a bit more complicated than with the λ CAD. We will let λ and θ in the sigmoidal function be constant. This makes the normalization sum (C N ) in G sig constant as well: C N = t= The normalized sigmoidal CAD function is thus, G sig (t, λ, θ) = 1, for λ <. 1 + e λ(t θ) 1, for λ <. C N + C N e λ(t θ) We observed how Sutton went from one view to the other, therefore we will find the forward view sum of all updates in an episode, using the sigmoidal return instead - and then switch to the backward view, the same way Sutton did. We jump right to the point where focus is put on the TD error of a single update, (R sig t V t (s t )): R sig t V t (s t ) = V t (s t ) + r t+1 + γv t (s t+1 ) C N + C N e λθ 56

61 + r t+1 + γr t+2 + γ 2 V t (s t+2 ) C N + C N e λ+λθ. + r t+1 + γr t+2 + γ 2 r t+3 + γ 3 V t (s t+3 ) C N + C N e 2λ+λθ + r t+1 + γr t+2 + γ 2 r t γ T t V t (s T ) C N + C N e (T t 1)λ+λθ (6.2) We set C = e λθ, and again gather all the r t+1 terms in the first column, all the r t+2 terms in the second column, and so forth.. R sig t V t (s t ) = = k= δ t {}}{ + γ [ r t+1 + γv t (s t+1 ) V t (s t ) ] + γ 1( 1 ) 1 [r t+2 + γv t (s t+2 ) V t (s t+1 )] C N (1 + C ) + γ 2( 1 1 ) 1 C N (1 + C e λk [r t+3 + γv t (s t+3 ) V t (s t+2 )] ). + γ 3( 1 2 k= + γ T t 1( 1 T 1 k=t 1 ) C N (1 + C e λk [r t+4 + γv t (s t+4 ) V t (s t+3 )] ) T t 2 k= k t 1 γ k t (1 p= 1 ) C N (1 + C e λk [r T + γv t (s T ) V t (s T 1 )] ) 1 C N (1 + C e λp ) )δ k. (6.21) That completes the first half! We now have the full episodic update using the forward view that focuses on state s t : T 1 k= V T Dsig t T 1 (s) = α k= I sst T 1 k=t k t 1 γ k t 1 (1 C N (1 + C e λp ) )δ k. (6.22) We can easily change to the more realistic backward view, which focuses on the return δ t instead: T 1 T 1 t t k 1 Vt T Dsig (s) = α δ t γ t k 1 (1 C N (1 + C e λp ) )I ss k. (6.23) k= k= k= T 1 = α δ t γ t (s)e sig t (s) k= To implement the new T D(λ, θ) we need a trace implementation. I had to use three variables, a credit trace e sig t (s) approximating the credit sum, a discount trace γ t (s) updating the discount factor separately, and a trace timer φ t (s). This is because of the structure of eq.(6.23), 57 p= p=

62 which is not as nice and simple as Sutton s λ-return distribution. In the implementation, the trace timer is updated after the trace using the following rule: φ t (s) = { φt 1 (s) + 1, if s s t ;, if s = s t. The accumulating trace update is here inappropriate, we instead use the replacing trace rule 7, for the credit trace update: { e sig (e sig t (s) = t 1 (s) G sig (φ t 1 (s), λ, θ))f (e sig t 1 (s)), if s s t; 1, if s = s t. (6.24) where f (e sig t 1 (s)) is a zero indicator, that is if the trace is and 1 otherwise. The update of the discount trace γ sig t (s) is trivial: γ t (s) = { γt 1 (s)γ, if s s t ;, if s = s t. (6.25) We are now ready to make a control algorithm based on the new trace based learning algorithm T D(λ, θ) - but first we will look at the eligibility trace versions of SARSA and QLearning. 6.5 Trace Based Control Algorithms The SARSA(λ) algorithm When we transform the T D(λ) learning algorithm directly into a control algorithm, by substituting V π (s) with Q π (s, a), we get the trace based version of the on-policy control algorithm, SARSA(λ): Q π t+1(s, a) = (1 α)q π t (s, a) + α ( r t+1 + γq π t (s t+1, π(s t+1 )) ) e t (s, a) = Q π t (s, a) + αδ t e t (s, a), for all (s,a). (6.26) where we now have a trace variable for each SAP The Watkins Q(λ) algorithm Off-policy algorithms estimate Q instead of Q π, and this makes a difference when we want to make a trace based version of QLearning. In off-policy learning we don t want suboptimal exploratory actions to interfere with the Q-values. When the agent takes an exploration action the consequence should not spill back on former SAPs. Therefore the trace variable is reset if the agent diverts from the greedy policy. The trace does not follow the agent until the end of an episode, but only until the first non-greedy action. { γλet 1 (s), if Q e t (s, a) = I sst I aat + t 1 (s t, a t ) = max a Q t 1 (s t, a);, otherwise. 7 The replacing trace will be introduced in the next section. 58

63 Otherwise Q(λ) uses the trace like SARSA(λ), ( ) Q t+1(s, a) = (1 α)q t (s, a) + α r t+1 + γ max Q a t (s t+1, a t+1 ) e t (s, a) t+1 = Q t (s, a) + αδ t e t (s, a), for all (s,a). (6.27) Actually Watkins Q(λ) cuts the trace after it has used the reward consequence of the first non-greedy action. The same goes for one-step QLearning. The downside of Q(λ) in relation to SARSA(λ) is obviously, that it cannot use the trace to the same extent, since the trace is cut now and then. I have used Watkins s 8 version of Q(λ) instead of Peng s 9, because it is simpler to implement The SARSA(λ, θ) algorithm and the Q(λ, θ)algorithm The control version of my own TD(λ, θ) is like the SARSA(λ) implementation in (6.26), but with the three variables e sig t (s, a), γ sig t (s, a), and φ t (s, a), which is the SAP based version of (6.24) and (6.25). Q(λ, θ) is like Watkins s Q(λ) with its alternative trace update - below is the credit trace update for Q(λ, θ): { e sig γ(e sig t (s, a) = I sst I aat + t 1 (s, a) G sig (φ t 1 (s, a), λ, θ)), if Q t 1 (s t, a t ) = max a Q t 1 (s t, a);, otherwise. The challenge of efficient implementation and tests There are several problems in testing the efficiency of SARSA(λ, θ) and Q(λ, θ). First, the strength of the algorithms will not show up as long as the credit distribution is static and not variable as a function of state. Actually the whole point is to be able to fit the distributions, so that they accurately reflect environment dynamics. Secondly, the importance of CAD flexibility grows with the problem scale and the lack of instructive experience. However, the implementation and testing of a variable version of SARSA(λ, θ) and Q(λ, θ) on large, complex problems, is a project in itself. In this project, which has a broad focus, I have thus limited the implementation to the depicted static version and tested it on a simple constructed grid world problem. The Potential and the downside The new sigmoidal CAD is computationally more expensive than Suttons λ CAD, because of its mathematical structure. The potential in SARSA(λ, θ) and Q(λ, θ) lies in the possibility of faster convergence per update, because the sigmoidal CAD gives a better Q-estimate. However, with the flexibility of the sigmoidal CAD comes also the ability to make the TDerror even bigger. For a certain class of episodic problems, it seems to be possible to implement a semivariable sigmoidal CAD. In this problem class there are certain terminal states or SAPs, that are fatal for the agent. Bad behavior leading to these states can be described within a short 8 Watkins, C.J.C.H. (1989). 9 Peng and Williams(1994) 59

64 time interval. On the other hand good behavior leading to high reward terminal states is made up of a very long series of SAPs. Behavior cannot be labelled as good within a short period of time. I will give an example. An agent driving on a race track - he is doing really well, but suddenly he takes a bad action (takes a sharp turn driving 3 km/h) and the car flips! The agent was following an optimal policy until the execution of the bad action. Optimal behavior on a race track is defined as finishing a lap in the shortest amount of time while staying on the track. The agent has to execute a very long series of SAPs before we can conclude that it behaved optimally. In this class of problems, you will among others find J. Randløvs bicycle task 1, and Martin Jägersand s PUMA robot arm with a Utah/MIT built robot hand that inserts and screws in a light bulb under visual control. 11 The semi-variable sigmoidal CAD solving such problems, could consist of two CADs; a negative reward CAD and a positive reward CAD. The negative reward CAD would have a relatively large negative λ < 1 making it drop fast, and a θ value estimating the "bad behavior" time interval from the crucial action to termination. The CAD should only assign the blame back to the crucial action, and no further. The positive reward CAD would on the other hand have a small λ with > λ >.1 and a θ value around zero making the distribution look more like Sutton s λ distribution. This makes the CAD take the long time interval characteristic of optimal behavior into account, while not forgetting the decreasing probabilities. There are still implementation problems in that the iterative construction of the trace, can no longer be iterative. This is due to the fact that we cannot change the CAD from positive to negative before we receive the negative reward. We don t know that it will happen prior to the event. The weight of the trace has to be constructed on a spot based on the negative reward CAD, before we can update the Q-values. We use the trace timer φ t (s, a) for the construction. When credit for the very negative reward has been distributed, we return to the positive reward CAD. 6.6 Experiments λ experiments Temporal credit assignment is - as I have shown - difficult to address accurately. When deciding what static size λ to use, the best strategy is trial and error. I have carried out the λ test on a 8-connectivity 15x15 random maze. The performance of each λ value was average over 1 runs. The agents used Max Boltzmann exploration with ɛ =.5 and T = 15. Standard settings was used for the rest of the parameters. I used the replace trace rule for all agents. Result Analysis I actually thought that the lambda test was going to be a trivial affair, just showing us that when solving standard grid world problems, it pays off to have a lambda value close to randlov/

65 2 2 Total Reward 6 5 Total CPU Time 2.5 x 14 Total No Steps Accumulated Reward Accumulated CPU time/sec Accumulated No Steps Figure 6.6: The credit assignment error is growing proportional with the lambda factor. However, the test results actually show the CAD error, that I have tried to correct with my sigmoidal CAD. Notice the graphs for λ =.8, λ =.9 and λ =.95. Convergence is at first fastest with λ =.95, simply because the agent knows its way to the goal after a few episode. However, we can observe that the CAD for λ =.95 is off in relation to the real transition probabilities, since convergence after the initial fase is slower than for λ =.8. We can observe that λ =.8 actually takes the lead around episode 1. If the cost of wrong credit assignment was higher, convergence for large lambda values would be proportionally slower. This shows the potential in variable λ implementations and adaptive sigmoidal CAD implementations. An easy but naive variable λ implementation, that would have success with random mazes, would be simple λ annealing SARSA(λ, θ) and Q(λ, theta): do they work? Observe figure 6.7 and imagine, that you are a robot running out of gas - and you are standing on top of a hill. If you take two steps forward you will slide down a drain without getting hurt and receive gas. If you take a step to the left you will slide down another drain, but there is no gas at the end. If you take one step forward, and one step to the right, you will slide down a third drain and crash. The task is episodic and the transition function and reward function is off course unknown. The problem is deterministic. I constructed this simple test with transition probabilities similar to the bicycle fall, to show the potential in a more flexible CAD. The agent starts out in state s 1. At the end of the three hill falls are terminal states with three different rewards. If you step over the edge you will end up in the terminal state at the bottom independent of your choice of action. 61

66 Start state Sub optimal goal Solid Obstacles Crash State Optimal goal Figure 6.7: A deterministic delayed reward task. Result Analysis Yes they work, as depicted in figure but the question of their actual practical potential will not be answered in this project. There is an important critique of my test experiment, this is the fact that I knew the transition probabilities of the task and could point out the crucial action(s). Therefore it was possible to fit λ and θ exactly to the test. Still I was also able to fit the λ factor in Q(λ), and the sigmoidal CAD still prevailed because of its ability to drop fast from high to low credit weight. In real world problems the agent does not know transition probabilities prior to acting. Nonetheless, problems can have CAD characteristics that can be pointed out prior to learning. The challenge is to set the λ and θ parameters on the fly using this knowledge. This will be a way to integrate ones knowledge about the problem into a reinforcement learning algorithm. This test shows that SARSA(λ, θ) and Q(λ, theta) might be useful in an adaptive version. 62

67 1 x 14 Total Reward 18 CPU Time Accumulated reward Accumulated CPU time/sec Sarsa Lambda = 1 Theta = 4 Sarsa Lambda = 5 Theta = 4 Sarsa Lambda = 1 Theta = 4 Sarsa Lambda = 1 Theta = 4 Q Lambda = 1 Theta = 4 Q Lambda = 5 Theta = 4 Q Lambda = 1 Theta = 4 Q Lambda = 1 Theta = 4 Sarsa Lambda =.6 Sarsa Lambda =.7 Sarsa Lambda =.8 Sarsa Lambda =.9 Q Lambda =.5 Q Lambda =.6 Q Lambda =.7 QLearning Figure 6.8: Convergence is always fastest for the methods based on the sigmoidal CAD, if the λ factor in the sigmoidal CAD is large enough (greater than -1). If the CAD is stretched beyond the crucial action, the wrong actions will be credited. In addition, if the first high reward terminal state we visit is the crash, the suboptimal solution will be preferred. If the CAD is very short like one-step QLearning, we can observe that convergence is just very slow. However, there is a catch to the success of the sigmoidal CAD Replacing Traces If we take a closer look at the accumulating trace update in (6.9), we can observe that the trace can become larger than 1. This is counter-intuitive in regards to the Markov property (recall the discussion of delayed rewards under the Markov Property). Singh and Sutton(1996) developed another trace update rule called the replacing trace rule. { 1, if s = st and a = a e t (s, a) = t γe t 1 (s, a), otherwise. Singh and Sutton(1996) simply reset the trace for the revisited SAP (s t, a t ) and set it to 1. This has a surprisingly positive effect on convergence. An even more aggressive replacing trace rule is the following, 1, if s = s t and a = a t e t (s, a) =, if s = s t and a a t γe t 1 (s, a), if s s t. 63

68 where all SAP s based on the revisited state s t are reset. Also this proved to be a correct assessment, as you will see in the following trace experiment CPU Time per Episode 12 1 CPU Time 3.5 x Total No Steps 14 3 Accumulating Replace Aggresive Replace CPU time/sec Accumulated CPU time/sec Accumulated No Steps Figure 6.9: The trace update rule test is based on 1 runs of 2 episodes using a 15x15 random maze (Random15x15 2.mat) with one terminal goal state and the standard reward function. Result Analysis I used SARSA(.9) when comparing the three trace update rules. The two replacing trace rules was stable and superior every time. Singh and Sutton did not comment on the Markov property and its influence on the success of the replacing trace rule. Instead, they gave an example, which if turned upside down actually speaks against replacing traces 12. The reason for its success is to some extent a proof of the Markov property s success in approximating environment dynamics. Under Markov a reward r t+1 is only depended on (s t, a t ). Crediting other SAPs is only a matter of estimating to what degree transition probabilities take us to state s t. So when we re-enter a state s, it is straight forward, according to Markov, that we should reset all traces based on s, since it is only the action we choose now in s that will influence the path following s Speeding up computation of trace based methods The time complexity of trace based learning methods is greater than O(n 3 ), since the trace update of all SAPs is a triple for-loop. This is obviously not the most efficient update routine, as SAPs with a zero trace are also updated. One way to speed up updates is thus to 12 See for yourself in Sutton & Barto (2) figure 7.18, page

69 implement an efficient data structure that holds the dynamic set consisting of all SAPs with non-zero traces. The dynamic-set operations delete and insert must run in O(1) time. A good choice would thus be lists. In addition to only updating non-zero traces, one should also check the TD() error. I implemented a δ check in all trace methods, such that for δ = only traces are updated, not Q-values. This has a considerable impact on initial computational time. 1 3 Steps per episode 1 2 CPU time per episode 3 Total CPU time 25 Steps 1 2 Time in sec Accumulated CPU Time in sec Figure 6.1: Comparing Watkins Q with and without the δ = check. On a small 1x1 random maze. With the binary reward structure. After the initial fase all δ values are non-zero, and the check is just an extra computational cost as one can see by comparing the left and the middle graph, however it is a small price to pay for the initial cut, that will be considerable for large state-spaces. An alternative way of speeding up TD(λ) is by approximation. Frederick Garcia and Florent Serre(2) have made an efficient asymptotic approximation of TD(λ) based on accumulating eligibility trace. They pushed the computational cost of their approximated TD(λ) algorithm called ATD(λ) down to a TD()level, by only doing about three updates per step. Garcia and Serre did this by minimizing the norm of the difference between the matrix gain of ATD(λ) and the optimal matrix gain corresponding to TD(). Garcia and Serre showed that there exists a strong interplay between the optimal value and the choice of the learning rate α, and that the pair (λ = 1, α n (s) = 1 N n (s)) defines a new very efficient temporal difference learning algorithm called ATD. I have not implemented and tested ATD, and p.t. I have not read about the applicational prospects of ATD. However, for RL users, this shows the importance of correct α settings. 65

70 Chapter 7 Planning - Reflection as a way of learning Imagine that you are about to do a high jump in an athletics competition. You concentrate. You close your eyes and visualize the whole sequence of actions that you are about to perform. You close your eyes to enable yourself to focus on the mental representation of your experience. You now take the high jump ten times. In your mind. You correct yourself, reflecting in detail on every move and its consequence. The ability to plan ones actions and reflect upon mental representations of the world have long been seen as an important characteristic of intelligent beings. No known animals have demonstrated any remotely comparable ability to plan their actions to the same extent as humans. Reinforcement learning methods based on this ability are called planning methods or model based learning methods, and they have proven to be essential for solving advanced tasks effectively. When using a model in reinforcement learning, we start to approximate the environment dynamics in a separate data structure called a world model. We can then use the world model to simulate experience that we can use to learn from. Without a world model, we only learn when directly interacting with the environment. In other words, we only think one single time about experienced events, and that is when we perceive them. Planning methods on the other hand, can reflect upon interesting experiences several times and learn from this reflection. The learning system is described in figure 7.1. In planning methods real world experience is used for two things. It helps improve the model, so-called model-learning and it improves the value function directly using the learning methods depicted in former chapters, this is also referred to as direct reinforcement learning. Indirect learning accurse when you use simulated experience to learn. There are two different kinds of models - first the distribution model, which estimates Pss a and Rss a. ] Model distribution (s, a) = [P ss a, Ra ss, s S. 66

71 Policy Q table Indirected learning Direct learning Model learning Model Real experience Figure 7.1: The arrow from policy to Q-table is dashed, as it is only true for on-policy methods. This is the kind of model we need, if we want to use dynamic programming as the learning method. The second kind is the sample model which only stores a single sample of experience for each SAP: Model sample (s, a) = (s, r). This sample is only one possible transition out of the whole distribution. If we want the model to give a realistic picture of the environment dynamics, then sample models can only be used for deterministic environments. However, sample models can still be effective for learning stochastic control problems, since the learning algorithms are suited for approximating stochastic environments using sample experience. Obviously sample models can bias Q-values towards a single sample, if we are not careful of how much we use the model, before updating it with new samples. I will introduce two model based learning algorithms, Dyna-Q (Sutton,199) and Prioritized Sweeping (Moore and Atkeson, 1993). 7.1 The Dyna-Q algorithm The idea is basically that for each time step, we update Q using the just received experience, just like normal one-step QLearning, and in addition we make N updates of Q using simulated experience. The N updates are random updates of previously executed SAPs, where the model has some recorded experience. Below is the Dyna-Q algorithm 1 : Initialize Q(s, a) and Model(s, a) for all s S and a A(s) Do for each time step: 1 From Sutton and Barto(2) page

72 1. s = current nonterminal state 2. a = P olicy(s) 3. Execute a; (s, a) (s, r) 4. Q(s, a) = Q(s, a) + α(r + γ max a Q(s, a ) Q(s, a)) 5. Model(s, a) = (s, r) (sample model) 6. Repeat N times: s = random previously observed state a = random action previously executed in s (s, r) = Model(s, a) Q(s, a) = Q(s, a) + α(r + γ max a Q(s, a ) Q(s, a)) Planning methods simply use their gathered experience to a bigger extent. However, Dyna- Q cannot be called "intelligent" reflection, as we just reflect upon random previous experiences. We could in other words be spending our time reflecting upon SAPs with a zero TD error. This is inefficient thinking!..and consequently slow learning. 7.2 The Prioritized Sweeping algorithm Instead of random reflections upon previous experiences, we want to keep careful track of the most interesting and instructive experiences that we have had. When updating the Q- values, we would like make the updates that literally makes a difference. That is why we should prioritize the updates according to the TD() error of the previous experienced SAPs. Large TD() errors means that the Q-values do not reflect an important event in the problem solving process. The algorithm that prioritizes its Q updates in this way, is called The Prioritized Sweeping algorithm by Moore and Atkeson. In Prioritized Sweeping(PS) all experiences that result in a TD() error above a defined minimum threshold η is put in a maximum priority queue. We then spend our time updating the Q-value with the biggest TD error. After extracting the maximum prioritized SAP (s,a) from the priority queue and updating the Q(s,a), we run through those predecessor states of (s) that we previously have experienced and check their TD() error. We can only reflect upon predecessors that we have experienced, otherwise the model does not hold information about the transition to (s,a). If the TD error of (s,a) is large, then it is likely that some predecessor updates will result in considerable TD() errors. Predecessors with a TD() error greater than η are inserted in the priority queue. If a predecessor is already in the queue its priority is reevaluated. To implement efficient PS is a complex affair. First one needs to implement a priority queue. I implemented a heap data structure, as the three procedures needed all run in O(lgn) time. Secondly, one needs to manage the experienced predecessor SAPs of each state. This is a dynamic set with no repetitions of SAPs, I thus implemented a linked list. Below is my Prioritized Sweeping algorithm which is similar to Sutton and Barto s implementation (2): Initialize Q(s, a) and Model(s, a) for all s S and a A(s) Do for each time step: 68

73 1. s = current nonterminal state 2. a = P olicy(s) 3. Execute a; (s, a) (s, r) 4. Insert (s, a) at the top of the priority queue pq. 5. Model(s, a) = (s, r) (sample model) 6. Repeat N times or until pq is empty: (s, a) = Heap_Extract_Max[pQ] (s, r) = Model(s, a) Q(s, a) = Q(s, a) + α(r + γ max a Q(s, a ) Q(s, a)) For all (s, a ) P redecessor(s) do: r = Model(s,a ).reward; T D err = r + γ max a Q(s, a) Q(s, a ), T D err is the TD error of the Q(s,a) estimate. if T D err > η then Heap_Insert[pQ,(s,a )], unless (s, a ) pq then Heap_Increase_Key[pQ,(s,a ),T D err ]. Moore and Atkinson used DP and value iteration for model learning, where as Sutton and Barto use Q-learning, because their model is sample based. At one point I have changed Sutton and Barto s implementation back to the way Moore and Atkinson originally implemented it. Sutton and Barto only puts (s t, a t ) in the queue if the TD error is greater than η. I always update Q(s t, a t ), since (s t, a t ) is put at the top of the priority queue regardless of the size of its TD error. In this way PS behaves at a minimum like QLearning, and will always converge regardless of the size of η. This also makes PS prioritize current events to old events PS related to trace based methods Lets consider the Q update process of PS in relation to trace based methods. When using traces, one updates the Q-values of SAPs with non-zero eligibility traces. However, if the TD() error δ is close to zero, then the computational time could undoubtedly have been spend more efficiently. I implemented the δ = check, which helps in an initial fase, however, PS takes the full step and prioritizes the updates according to the size of δ. This makes PS converge fast because it makes few but instructive updates. The questions is if the model error is bigger than the CAD error. 69

74 7.3 Experiments The size of the threshold η and the number of reflections in PS The size of η should depend on the reward function. If we for example use the binary reward function with a goal reward equal to 1, then η should be smaller than the learning rate α, before any elements will be put in the priority queue. Below is the development of the number of elements in the priority queue with different size η. Figure 7.2: From the top the η values are:.3,.2,.1,.5, and. With a small η value the queue is filled up after one goal visit. The performance differences in the first episode is random. 1 4 Steps per episode 1 2 CPU time per episode 15 Total CPU time Steps Time in sec Accumulated CPU Time in sec Figure 7.3: The test was carried out on a small 1x1 maze, with a priority queue size of 5 elements and with 5 reflection updates per step. As one can observe, the η value should be close to zero. However, the extra computational cost of trying to put every SAP and its experienced 7

75 predecessors in the queue, still matters. Thus we should save the computational time of even looking at TD errors equal to zero or close to zero. Thus η should be small, but still fitted according to the reward function. The size of the priority queue depends on the problem scale - you do not want to through away valuable updates because your queue is too small! The number of reflections Assuming that one has a reasonable size queue and a non-zero η value, setting the number of reflections is to a great extent a matter of how one prioritize ones CPU time. A robot needs CPU time for IO activities such as perceiving and analyzing data from sensors, outputting control signals to the wheels, communicating with other robots, etc. If one implements the Q-updates in a separate process, a reflection process could be running in the background at all times. The implementational experience with Prioritized Sweeping shows that wrong models does not seem to be a real problem. Convergence simply speeds of with the number of reflections. The catch is prioritizing your CPU time, because the agent can spend a lot of time just standing thinking! Model Based Methods versus Model Free methods 1 3 Steps per episode PS(5).1 PS(1).1 DynaQ(1) Watkins Q(.8) SARSA(.8) 1 2 CPU time per episode 1 1 Steps 1 2 Time in sec Figure 7.4: The agents only ran for 2 episodes. The model based methods have all converged around episode 15. The test was carried out on a 15x15 maze with 8 connectivity, using ɛ-greedy exploration, with ɛ =.1. First, the test merely confirms the results of other researcher such as Atkeson(1997). Model based methods are more data efficient than direct reinforcement learning, and thus converge faster. The test also shows, that DynaQ is seemingly faster than PS. PS s 71

76 reflections should be qualitatively better than DynaQ s, as they are prioritized. 5 4 PS(5) Number of elements in queue PS(1) Step Figure 7.5: The PS agents only reflected part of the time. They functioned like QLearning-agents the rest of the time. The answer to why PS converges slower, is found when we observe the priority queue during the test. 2 In figure 7.5, we can observe that most of the time, the priority queue is empty after the reflection updates. This means that PS has in these step intervals reflected 1 times or less, and maybe not at all. DynaQ, on the other hand, always reflects 1 times. PS also run considerately slower than the other methods. This is because PS makes several more calls to global structures at every step. The MATLAB compiler does not - like a C compiler - optimize the code. It does not change global variables such that they are not put on a separate page in memory. In C, all globals would be made local and transferred through function calls. In this way they would stay on the inexpensive call stack. This lacking compiler feature makes MATLAB globals extremely expensive. To optimize PS and the rest of the program all globals should be removed. Model based methods seems to be the future of reinforcement learning. They are more data efficient than direct RL, they find better policies, and finally the model can be used to efficiently change the Q-values towards new goals. Model-based methods are a must for the robotics community because, in many classical robot tasks, the robots lack experience at all times. With model based learning, one implements a separate process(on a separate CPU if possible), that is dedicated to indirect learning. The process is restarted every time step, after the model has been updated with the new sample experience. 2 The priority queue development is from the last run, as averaging does not make sense. 72

77 Chapter 8 Conclusions In this chapter, I will describe my contributions to the different topics within reinforcement learning. I will also describe the limitations of the points that I have made and the algorithms that I have implemented. Finally I will shortly give my view on the future of reinforcement learning and AI. 8.1 Contributions The discounting factor of the learning rate I found an interesting discounting property of the learning rate α, after setting γ equal to 1. The learning rate actually discount future rewards in the initial fase A grid world based reinforcement learning simulator I have implemented a grid world based reinforcement learning simulator RL Basics for MAT- LAB. It can simulated several types of control problems. The simulator has build in six exploration strategies and eleven learning algorithms. It is possible to vary all parameters involved in the learning process, from the granularity of general policy iteration to the learning rate. The simulator is a good place to start out experimenting with reinforcement learning. Limitations The simulator has no build-in noise, as my project do not deal with perception problems, but only learning. For realistic simulations, noise is alpha and omega. The transition probabilities are as of now also limited in the way described in chapter Challenging exploration strategies with a specially designed test I discovered the advantages and disadvantages of each exploration strategy using the hidden goal task. The task was able to reveal the importance of balance between exploration and exploitation. 73

78 Limitations The tests should have been carried out much more thoroughly, if I was to conclude with an appropriate amount of scientific accuracy, which strategies that prevailed in what situations The Weakness of the Cliff Walking Task I found a weakness in Sutton and Barto s classical cliff walking task, which shed light on the importance of exploration in learning and the limitations in grid world problems. The cliff walking task is still an easy way to show an important point, but my comment should follow to underline the clear limitations of the task Temporal Credit Assignment I have discussed the λ return algorithm and the problems with its built-in credit assignment distribution(cad). Based on the identified CAD problems, I have developed a new CAD based on a sigmoidal function. I have shown that the new sigmoidal CAD has a flexibility which makes it able to do more accurate trace updates, and thereby speeding up convergence. I have implemented the sigmoidal CAD in a novel trace based learning algorithm TD(λ, θ). Limitations The new sigmoidal CAD is computationally more expensive than Suttons λ CAD, because of its mathematical structure. This obviously makes the TD(λ, θ) based control algorithms more costly than their TD(λ) counterparts. In addition, TD(λ, θ) also has the same parameter sensitivity as other direct reinforcement learning methods. It is hard to fit probably. However, in an adaptive version its flexibility could prove beneficial Model-based Learning I have described and implemented two different planning methods. Peng and Sutton s Dyna-Q and a sample based version of Moore and Atkinson s Prioritized Sweeping. I analyzed the parameter settings and compared and discussed model based versus model free methods. 8.2 Perspectives on Reinforcement Learning and AI By now it is widely accepted that learning a task from scratch, i.e. without any prior knowledge, is - to put it mildly - a daunting undertaking. Humans, however, rarely attempt to learn from scratch. They extract initial biases as well as strategies on how to approach a learning problem from instructions or demonstrations of other humans. In addition, humans 74

are equipped with a predefined, highly specialized neural network containing roughly ten billion neurons, where each neuron has between 1 and 1. connections (synapses) to other neurons.

79 are equipped with a predefined, highly specialized neural network containing roughly ten billion neurons, where each neuron has between 1 and 1. connections (synapses) to other neurons. I have been processing data non-stop for almost thirty years using this enormous brain of mine. Taking the power of my "multi-processor" and the time spend into account, I am definitely not overly excited about my performance or my learning speed. In this human perspective, I think AI and reinforcement learning has come a long way. Furthermore, the human perspective also underlines that size matter. To take AI further we need to realize the clear limitations of todays non-parallel binary computers. A promising research project is the DNA computer. This might result in the technological revolution that will change the AI community forever. The DNA computer performs 33 trillion operations per second and produces billions of calculations simultaneously. Figure 8.1: The highly predefined new-born humans touches a central issue for the future of reinforcement learning, which is a priori knowledge. We are working with agents having no prior knowledge what so ever, so the initial converge problems that reinforcement learning algorithms experience, can come as no surprise. To be able to solve complex problems, we need to get better at integrating human knowledge into policies. Kaelblings human controlled exploration is the right direction. My sigmoidal credit assignment distribution is also a way of integrating prior knowledge to a larger degree. Advancing reinforcement learning As of today reinforcement learning algorithms only deal with first order Markov decision processes. I see this as the most explicit limitation of reinforcement learning. In the light of the DNA computer prospects one should look into the possibilities of learning on the basis of n th -order Markov decision processes, even though it might be extreme computationally costly. 75

$Appendix A RL Basics 1.1 A.1 Users manual To use the RL Basics simulator you must set MATLABs current path to..\rlbasics1.$

80 Appendix A RL Basics 1.1 A.1 Users manual To use the RL Basics simulator you must set MATLABs current path to..\rlbasics1.1\src Type pathsetup. Figure A.1: Here you type the path of the program library RLBasics1.1. Now type RLBasics - this will start the program. 76

Here you can also define if the agent should have 4 or 8 actions in each state.

81 Figure A.2: First you must specify how many agents you want to compare. Second how many episodes the agent gets to optimize itself. Third have many runs performance should be average over. Figure A.3: Set the problem you want to solve in the Problem settings menu. Here you can also define if the agent should have 4 or 8 actions in each state. When you turn on the topology dimension, a figure pops up telling you, where the agent can t go because of the landscape gradient. You can control this by the size of the max gradient value. One can also create problems in Problem Settings. They will appear in the list when you press make. 6x5 is the smallest possible problem. 77

82 Figure A.4: In the menu you will also find the reward function, where you can set the scalar reward for each type of state. Figure A.5: The transition probabilities can be set in the State transition function, which can be found in the top Problem and Environment menu. You are now ready to define how each agent should learn and explore. Click on the Define agents button. 78

83 Figure A.6: In this window you can define each agent. First define the basics learning parameters in Basic RL. Figure A.7: Second, you must pick a learning algorithm. 79

84 Figure A.8: Here is ten different learning algorithms group in one-step, trace and modelbased.finally, you must define what exploration strategy, you want for each agent. Figure A.9: When this is done you press the ready to run test button in the Define Agent Main window, and then Run test in the RL Basics window. 8

85 Figure A.1: The test result is presented in six different graphs. 81

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version: