Sequential Decision Problems - PDF Free Download

Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted in bold red font. This will allow you to quickly compare this version of the notes to an old version. 1 Introduction To this point in the class, we have studies stateless games when the payoff matrix is known. We have explored these games as both stage (single shot) games and as repeated games. In this portion of the class, we will continue to study repeated games, but we will assume as little about the problems as possible and we will look at a class of algorithms that allow us to represent states. To be precise, we will assume that we don t know the game matrix, we don t know anything about the strategy of the other agent, and we don t know what strategy we should apply to do well in the repeated game. Within this constraints, we will explore a class of reinforcement learning algorithms that attempt to learn various parameters of the repeated game. The purpose of this tutorial is to introduce you to the necessary concepts to understand reinforcement learning algorithms. 2 Sequential Estimation To begin this tutorial, we will study a technique for sequentially estimating the mean of a random process. To this end, suppose that we have a random variable X. Recall that a random variable is a special type of function (a measurable function to be precise) that maps the set of theoretically distinguishable events in the world to the set of real numbers. The intuition behind a random variable is that the world can take on a number of distinguishable states that can be described using the axioms of probability. Though distinguishable, not all of these events can be observed directly. Rather, we have to invoke a measurement process that inherits the uncertainty of the state of nature. The random variable is a function that represents the measurement process it maps the set of distinguishable events to a real number in a lossy manner. If you don t really understand this, don t worry too much. The important thing to keep in mind is that a random variable assigns a real number to some event from the state of nature. If the mean of this random variable needs to be known, there are several ways to estimate it. Let P X (x) denote the distribution of this random variable, and let µ denote the mean of this distribution. Suppose that we want to create an estimate, m of the mean of this distribution given a sequence of observations, x 1, x 2,..., x n drawn according to P X (x). We can compute this estimate of the mean from 1

n samples using the following: m(n) = 1 n n x k. Unfortunately, with this form, if we want to compute the mean when we have n + 1 samples then we have to redo the sum. It would be better to come up with an iterative method where we express m(n + 1) in terms of m(n). m(n + 1) = = = 1 n+1 n + 1 x k 1 n + 1 [x n+1 + n x k 1 n + 1 x n+1 + n n + 1 m(n). We can rewrite this iterative method for computing m(n + 1) as a convex sum by setting α n+1 = 1 n+1 as follows: m(n + 1) = α n+1 x n+1 + (1 α n+1 )m(n). (1) This equation is interesting because it means that we can iteratively adjust our estimate of the mean by a weighted combination of the old estimate with a new observation. The sequential estimate for a random Figure 1: Computing the mean using sequential estimation. variable X N(µ, σ) for µ = 10 and two values of sigma is shown in Figure 1. Notice how it takes longer for the mean to converge for the random variable with the larger standard deviation. However, for both of these random variables, the sample mean will converge to the true mean as n. It is interesting to note that this property holds for not only α n = 1, but also for a very large class n of α n s provided that the values for α n get small fast, but not too fast. More precisely, given m(n) from Equation (1), the following theorem holds (which we will state, but not prove): 2

Theorem. If the following hold: lim n lim n n α n (2) n (α n ) 2 C < (3) then lim m(n) µ. n This theorem formally defines fast, but not too fast as constraints on how the sum of α n grows. Equation (3) says that α n eventually gets small enough that our estimate stops changing, and Equation (2) says that α n does not get small so quickly that m(n) stops changing before it reaches the true mean. We will use a variation of Equation (1) to estimate the quality of a state action pair in a sequential decision problem. We now turn attention to how we can define this quality. 3 Quality of a State-Action Pair To continue with this tutorial, restrict attention to a problem with only a single agent. Consider a world in which there are multiple states, S, but that the number of states is finite, S <. Suppose that from each state we have a finite set of actions that we can take, A, that lead us to a new state, but that the relationship between the current state and the next state is uncertain. We will model this uncertain relationship using a first order Markov process (please ask about what this is in class if you don t know), governed by the probability mass function p(s s, a). This probability mass function represents the likelihood of reaching a new state, s, from the existing state, s, when we play an action a. We call this probability mass function the transition probability, meaning, the probability that we transition from one state to the next when we take a particular action. Now, suppose that there are good things and bad things that we can do in the world, and that these good things and bad things lead to numerical rewards and penalties. Let these rewards and penalties be denoted by by the random variable R(s, a), and suppose that these reinforcers are also random. These reinforcers occur when we are in a particular state s and take an action a within this state. Our goal is to choose a good action no matter what state we find ourselves in. To this end, suppose that someone gives us a function that tells us what action to take in a given state. Denote this function by π : S A and call it our policy. How well does this particular policy work? To answer this question, we can pretend that we have a function that tells us the expected payoff for using this policy as a function of the state of the world. We will just hypothesize that such a function exists, make some assumptions about how utilities combine, and then we will start to study its properties. Denote this function V (s, π). This value function should accumulate both the immediate reward for making a choice π(s) in the current state, plus all of the future rewards that might accumulate. Since the world is probabilistic, we might spend forever jumping from one state to another. Consequently, we cannot just sum up rewards. 3

Instead, we will do what we did when we wanted to know how well a particular strategy works in an indefinitely repeated sequence of matrix games: we can discount future rewards. Let N(s) denote all of the states in the neighborhood of state s, meaning all of those states for which there is a nonzero probability of reaching given the transition probability p(s s, a) = p(s s, π(s)). The expected value of the policy in the given state is then [ ( ) V (s, π) = µ R (s, π(s))+γ p(s s, π(s))µ R (s, π(s ))+γ p(s s, π(s ))µ R (s, π(s ))+..., s N(s) s N(s ) (4) where µ R (s, π(s)) is the mean of the random variable that describes the reinforcer when action a = π(s) is chosen. This looks intractable because we have to average over the rewards for the next state plus the average rewards reachable from that state, and so on. Fortunately, there is a pattern that we can exploit to write a recursive definition of the function. The key to seeing this pattern is to note that the value for V (s, π) is the expected reinforcer plus the discounted average over all expected future rewards achievable from any member of the neighborhood. What is the expected future reward achievable from any member of the neighborhood, s N(s)? It is V (s, π). This means that Equation (4) can be rewritten as V (s, π) = µ R (s, π(s)) + γ p(s s, π(s))v (s, π). (5) s N(s) Thus, we have a recursive definition of the value of using a policy π : S A from state s. There are well-known algorithms for taking this recursive definition and solving for V (s, π). One technique involves rewriting Equation (5) as a series of vector equations in the variables V (s, π) for a fixed π and then solving the equation. This works well unless there are a lot of states; when there are a lot of states, solving the matrix can take a really long time (unless the neighborhoods of all of the states are small, in which case we can use techniques for solving matrix equations with sparse matrices). A technique that works when there are a lot of states is an iterative algorithm called value iteration. If you are interested in learning about this algorithm, let me know and I can point you to a reference. Although these algorithms are cool, they do not really help us get to where we are trying to go. Recall that our goal was to find the best policy given the random reinforcer and transition probability for a problem. Equation (5) only tells us the value of a policy in a given state. We could use this idea to search through all possible policies. This is theoretically computable since there are a finite number of states and actions. However, it is not practically computable because for each state there are A actions, for each pair of states there are A 2 policies, and for each set of S states there are A S. So, we will have to fiddle with Equation (5) some more. To this end, pretend that I knew the best possible policy. Denote this best possible policy by π. If you knew the policy then you could solve Equation (5) for V (s, π ) for all of the states. We can exploit the existence of this optimal policy to determine the quality of any possible actions within a given state. Let Q(s, a) denote the expected utility of choosing action a in state s assuming that I use the optimal policy thereafter. We can modify Equation (5) to help us evaluate Q(s, a) for all state-action pairs (while we still pretend that we know the optimal policy π ). This gives This means that Equation (4) can be rewritten as Q(s, a) = µ R (s, a) + γ p(s s, a)v (s, π ). (6) s N(s) 4

Now comes a very clever observation. If we knew Q(s, a), then we could find the optimal action in that state as a = arg max a A Q(s, a). Since the optimal policy produces the optimal action in a given state, we know that π (s) = a = arg max a A Q(s, a) which means that V (s, π ) = max a A Q(s, a). Plugging this into Equation (6) yields Q(s, a) = µ R (s, a) + γ s N(s) p(s s, a) max a A Q(s, a ). (7) We now have a recursion relation on Q(s, a) which can be solved using the value iteration algorithm. When this algorithm is done running, we have quality estimates (called Q-values) for each state-action pair when we choose optimally for every subsequent choice. 4 Q-Learning The problem with Equation (7) is that it requires us to know the transition probabilities. In keeping with our goal to use as little knowledge as possible, we should try and find an algorithm that lets us determine the Q-values without knowing the transition probabilities. In this section, we will discuss this algorithm. The algorithm, known as Q-learning, will be the basis for most of the multi-agent reinforcement learning algorithms that we will study this semester. The Q-learning algorithm uses a sequential estimate similar to the one used in Section 2. The algorithm is as follows: [ Q(s, a) (1 α)q(s, a) + α r(s, a) + γ(max a A Q(s, a)). (8) Since there is a lot going on in this equation, we will step through each of the parts. To understand the basic structure of the equation, consider state-free and simplified version of Equation (8, [ Q(a) (1 α)q(a) + α r. (9) Equation (9) does not include any dependence on s nor does it have the γ(max a A Q(s, a)) portion of Equation (8). In this form, it is easy to see that the the Q-learning equation is based on a sequential estimate for the expected value of a state, similar in form to Equation (1) with Q(a) replacing m(n) and r replacing x(n). To make this even more clear, we can replace the notation with the index notation and write Equation (9) as Q(a; n) = (1 α n 1 )Q(a; n 1) + α n 1 [r(a; n 1). (10) In this form, it is hopefully clear to you that r(a; n 1) is simply the n 1 st sample from the reinforcement random variable R, and that the n th estimate for the Q-value is just the sequential combination of this sample and the old estimate of the Q-value. Note that the Q-value is a function of the action that we are considering (each action produces a different reward) so we actually have A of these sequential estimates occurring, but that everything else in Equation (10) is analogous to Equation (1). Thus, the essence of Q-learning is creating a sequential estimate of the quality of performing a particular action. As such, we must select values of α n that get small fast, but not too fast which means that they satisfy Equations (3) and (2). 5

The next step in understanding the Q-learning equation is to recall that we are estimating the value of a given action in a repeated play context. Thus, we need to take into consideration the expected value of the next action that we will choose. Consider the following equation: Q(a; n) = (1 α n 1 )Q(a; n 1) + α n 1 [r(a; n 1) + γq(a; n 1). (11) This is identical to Equation (10) except for the presence of the γq(a; n 1) which is added to the sample from the reinforcer random variable. In effect, this extra term says that we are now including expected future reward, discounted by γ. For this equation, the expected discounted reward for continuing to play action a indefinitely into the future is not known for sure, but we do have a estimate of it in the form of Q(a; n 1). If Equation (11) works correctly, then we would hope to see lim Q(a; n) µ R(a) n 1 γ, meaning that our estimates of the Q-values converge to the expected discounted reward received for playing action a for all time. Indeed, this Equation (11) does cause this limit to hold, but we will postpone giving the reference that proves this claim until we further describe the Q-learning equation. The final step to interpreting the Q-learning equation is to reintroduce the state variable. Introducing state naively into Equation (11) gives Q(s, a; n) = (1 α n 1 )Q(s, a; n 1) + α n 1 [r(s, a; n 1) + γq(?, a; n 1). (12) Hopefully, you can see that introducing state does not change very much. The n th estimate of the Q-value for taking action a in state s is equal to the sequential estimate obtained by combining the old value of our estimate with a sample of the reinforcer received when we take action a in state s, modified by the term that says that we are doing this in a repeated play context. The problem is that when we take an action, we do not know for sure which next state will occur because the transition from one state to the next given an action is described by a Markov process. Fortunately, we can return to the basics of using samples from a random process to estimate the mean of the random process using a sequential estimate. Similar to the way that r(s, a; n 1) is the n 1 st sample obtained from the random variable R(s, a), we can obtain a sample from the discounted future rewareds by simply observe which of all of the possible next states in the neighborhood of s, s N(s), occurs. We will call the state that occurs s. Thus, we can replace the question mark in γq(?, a; n 1) by s yielding γq(s, a; n 1). Thus, introducing state into Equation (11) gives Q(s, a; n) = (1 α n 1 )Q(s, a; n 1) + α n 1 [r(s, a; n 1) + γq(s, a; n 1). (13) This equation says that our estimate of the quality of a state action pair is obtained as the sequential combination of our old estimate plus a new sample from the sum of our instantaneous reward, r(s, a; n 1), plus the discounted sample of future expected reward, Q(s, a; n 1). There is just one problem with this equation, and that is from the fact that we are assuming that action a is used in state s. You can see this by noting that the action a in Q(s, a; n 1) is the same action as the one used to obtain the reward, r(s, a; n 1). Since Q-values are supposed to be evaluations of the 6

expected reward of choosing a particular action in a given state and them choosing optimally thereafter, this is a problem. Fortunately, we have an estimate for the future expected reward for choosing optimally thereafter in the form of max a A Q(s, a; n 1). Notice that the a in this equation is a dummy variable and should not be confused with the a in r(s, a; n 1) of Q(s, a; n 1); we could just have written max b A Q(s, b; n 1). Thus, Equation (13) should really be Q(s, a; n) = (1 α n 1 )Q(s, a; n 1) + α n 1 [r(s, a; n 1) + γ max a A Q(s, a; n 1). (14) When we drop the dependence on n (but assume that it is there so that we can make sure that α gets small fast but not too fast, then we obtain Equation (8). 5 Implementing Q-Learning Watkins [1 proved that using Equation (8) causes our estimates of the Q-values to converge to the true Q-values provided that the following conditions hold: α values get small fast but not too fast (Equations (3) and (2). We run the program infinitely long (n ). We make sure to try each action in each state infinitely often (so that we do not always try the same action every time we reach, for example, some state s). These conditions seem like they are nice from a theoretical perspective, but they seem to make it impossible to use Q-learning in practice. Fortunately, we can still get very good estimates of the true Q-values using the Q-learning equation, but it takes a little parameter tuning. In the lab, you will be given a set of parameters that work pretty well and you will then play around with these parameters to see how they affect what is learned. In the remainder of this section, we will discuss some of the things that we can play around with and how they affect the behavior of the algorithm. 5.1 Starting and Stopping in a Path-Finding Problem In the upcoming lab, you will experiment with using Q-learning so solve a path-finding problem. The task will be to learn a table of Q-values (a table since you need to find values for all possible states and all possible actions from within those states). One way that you will measure the performance of your values will be by computing the average path length from a given starting point to the goal location. Penalties (r(s, a) < 0) will be assessed when you run into walls, but a payoffs (r(s, a) > 0) will only be delivered when you reach the goal. The first question to answer is how to compute the Q-values for the goal state. Let s G denote the goal state, so the question that we are trying to answer is how do we determine Q(s G, a) for all values of a. The easiest way to do this is to set rewards to zero for all values of s and a, except for when s = s G. When s = s G, set the rewards to r(s G, a) = c for some constant c for all actions. Under these conditions, the Q-values for this state will eventually become c 1 /gamma 7 for all actions. Since this can be computed offline,

you can use just set this before the algorithm starts to run. Then, when the goal state is reached during a learning episode, you can conclude the episode and begin a new episode at some starting location. This leads to the question of how to choose states from which to learn. One way to learn is to always begin in the starting location and then explore (using techniques outlined in the next section) until the goal is reached. The problem with this approach is that, because transitions from one state to another are only probabilistically determined by the chosen action, learning only from the given starting state will cause the state space to be unevenly sampled; states that are near to the direct path will be explored a lot because they will be reached with a high probability, and states that are far from the direct path will rarely be observed. Although this is not a problem from a theoretical perspective (because the algorithm will run an infinite number of times), it is a problem when we run the algorithm a limited number of times. One way to more uniformly sample the state space is to start at a random location in the state space, learn using one of the exploration methods described below, and then conclude learning when the goal state is reached. Randomly restarting will tend to produce a more uniform coverage of the state space and will make learning approach the true values more quickly, which means that we have a hope that the Q-values will be useful even if we do not let the algorithm run forever. 5.2 Exploration versus Exploitation One of the easiest ways to ensure that every action is taken from every state infinitely often is to randomly choose from the set of possible actions available to the agent in a given state. This ensures that there is enough coverage of the possible actions to lead the Q-learning algorithm to convergence. In a path-finding problem, we could theoretically randomly choose a state, randomly choose an action from that state, observe the reinforcement and state that results from choosing that action in that state, update the Q-values, and then repeat the entire process. This process never actually finds a path from a starting state to the goal, but if we wait long enough and have our α values decay properly, then the estimated Q-values will eventually approach the true Q-values. When we shift back to reality, we cannot wait forever for convergence. Instead, we want to try and determine if the estimated Q-values are getting close to true Q-values, or at least close enough that we are solving the problem we set out to solve. In a path-finding problem, we do this by (a) checking to see if the path is getting more efficient and (b) checking to see if we are having wild swings in the Q-values as things progress. To do this, we need to try and choose actions that bias the learning toward those actions that are likely to be closer to the optimal actions. Although we need to try and do this, we do not want to go so far that we forget to explore around a bit. In other words, we want to find a balance between exploiting what we have learned (and thus bias our learning toward actions that are more likely to be successful), and exploring (and thus ensure that sufficient exploration takes place to prevent premature convergence). In the lab, you will experiment with several different approaches to finding this tradeoff. At one extreme, you will always exploit what has been learned and depend entirely on the randomness in the world (and possibly random restarts) to explore. At the other extreme, you will randomly select actions and rely on the randomness in the world to lead you to the goal. 8

5.3 Decaying α In theory, any α value that decays fast but not too fast will lead to convergence. In practice, we have to be more selective. For example, we have already seen that α n = 1 satisfies the conditions of Equations (3) n and (2), but this is not a very good idea in practice because the values of n tend to get small too early in the learning which means that changes in Q-values in one region of the state space propagate to other regions of the state space very slowly. In the lab, we take the max of two different functions of α decay: one that stays high for a long time and then decays to zero exponentially fast, and the other that decays linearly with n. Another approach is to use a different α for each state, α(s; n), and then decay this state-dependent α only when the state is visited. 5.4 Convergence Issues Since we cannot explore forever, we want to decide when to quit. The approach suggested in the lab is to measure how much the Q values change as learning progresses. When this value gets really small, we can safely guess that most of the learning has stopped. At the very least, we need to evaluate how much things s S have changed for all possible Q-values, so we test whether a A Q(s, a; n) Q(s, a; n 1) < ɛ. When you do this, watch out for two commonly made mistakes (based on the 2004 lab write-ups). First, since the α values decay, it is possible to find that the Q-values have not changed very much simply because the value of α is so small. Figure out a way to test for convergence that avoids this problem. Second, you probably do not want to test convergence using a random restart that produces a very short path (e.g., starting right next to the goal and exploiting current Q-value estimates); such an approach will almost always produce small changes from one iteration to the next simply because so few Q-values could possibly be changed. Figure out a way to test for convergence that avoids this problem too. References [1 C. J.C.H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279 292, 1992. 9