State Space Reduction for Hierarchical Policy Formation

Size: px

Start display at page:

Download "State Space Reduction for Hierarchical Policy Formation"

Vincent Stevenson
5 years ago
Views:

Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 State Space Reduction for

1 Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX State Space Reduction for Hierarchical Policy Formation Mehran Asadi Technical Report CSE This report was also submitted as an M.S. thesis.

2 STATE SPACE REDUCTION FOR HIERARCHICAL POLICY FORMATION The members of the Committee approve the master s Thesis of Mehran Asadi Dr. Manfred Huber Supervising Professor Dr. Diane J. Cook Dr. Lawrence B. Holder

4 STATE SPACE REDUCTION FOR HIERARCHICAL POLICY FORMATION by MEHRAN ASADI Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING THE UNIVERSITY OF TEXAS AT ARLINGTON December 2003

5 ACKNOWLEDGEMENTS I would like to thank my research advisor Dr. Huber, whose ideas were the origin of this research and his advice is always invaluable. He was the first person who truly thought me the basic concepts of A.I. and showed me the path of doing research in this field. He spent late nights writing papers with me, and all his suggestions made me think about our research more deeply. I also thank my committee for their careful judgment and their suggestions to improve this research. I must also thank my wife for all her support, without whom obtaining this degree was absolutely impossible. I do not know how I can express my appreciation for being a part of her life. Next, I must thank my mother and my mother in-law, their support helped me to continue my education and they encouraged me to stay focused at every single step of my studies. Finally I would like to thank my brother Mehrdad, my sister in-law Elham and their daughter Donya, whose presence makes me always happy. August 17,2003 iv

6 ABSTRACT STATE SPACE REDUCTION FOR HIERARCHICAL POLICY FORMATION Publication No. Mehran Asadi, M.S. The University of Texas at Arlington, 2003 Supervising Professor: Manfred Huber This thesis provides new techniques for abstracting the state space of a Markov Decision Process (MDP). These techniques extend one of the recent minimization models, which is known as ε-reduction, to construct a partition space that has a smaller number of states than the original MDP. v

7 As a result, learning of policies on the partition space should be faster than on the original state space. The technique presented here is to execute a policy instead of executing a single action, and to group all states which have a small difference in transition probabilities and reward function under a given policy. This method is similar to a SMDP by expanding the actions on the original MDP. When the reward structure is not known, the reward independent method is introduced for state aggregation. The reward independent method for state reduction is applied when reward information is not available and a theorem in this thesis proves the solvability for this type of partitions. Simulation on different state spaces shows that the policies in both MDP and this representation are very close and the total learning time in the partition space in our approach is much smaller than the total amount of time spent on learning in the original state space. vi

8 TABLE OF CONTENTS ACKNOWLEDGEMENTS... ABSTRACT... iv v LIST OF ILLUSTRATIONS... ix Chapter 1. INTRODUCTION Decision making under uncertainty Models of planning Temporal abstraction Contribution of this thesis Temporal abstraction Action dependent decomposition Reward independent decomposition FORMALISM Markov Decision Processes The Value function Computing the optimal policy Value iteration Policy iteration State Abstraction Introduction to SMDPs vii

9 2.3.1 Policies Example of SMDPs PREVIOUS WORK SMDPs Policies Example of SMDPs State abstraction in MDPs Bounded Parameter MDPs Epsilon Reduction Methods Extension to Epsilon Reduction Method Epsilon-delta extension to SMDPs Action Dependent Decomposition A Simple Example Reward Independent Partition Experimental Results Conclusion and Future work References Biographical information viii

10 LIST OF ILLUSTRATIONS Figure Page 1.1 A sample robot navigation environment A graph view of a MDP Policy Iteration Comparison between MDP and SMDP The rooms example as a gird world environment The policy of one of the rooms The state transition diagram BPMDP Grid world of the example Partition for option left Partition for option up Partition for option right Partition for option down Intersection of blocks of partition Final blocks of partition The pattern for the experiments The final blocks of partition for experiment Dean s method run time The reward independent method running time Number of iteration for learning a policy...49 ix

11 4.13 Comparison of number of blocks Environment for comparing run time Running for Dean s method Running time of reward independent method Number of iterations for learning a policy...55 xi

12 CHAPTER I INTRODUCTION Markov decision processes (MDPs) are useful way to model stochastic environments as there are well established algorithms to solve these models. Even though these algorithms find an optimal solution for the model, they suffer from the high time complexity when the number of decision points is large. To address increasingly complex problems, it is also necessary to find representations that are sufficient to address the task while remaining sufficiently compact to permit learning in an efficient manner. The importance here is put on the state space representation used in the decision-making process rather than on the one used for sensing and memory purposes. The idea is that a reduced representation for decision making combined with the use of increasingly competent actions in the form of policies can dramatically reduce the number of decision points and can lead to a much more efficient transfer of learning experiences across situations and tasks. A number of learning approaches have used specially designed state space representations to increase the 1

13 efficiency of learning [4,8]. Here, particular features are hand-designed based on the task domain and the capabilities of the learning agent. In autonomous systems, however, this is generally a difficult task since it is hard to anticipate which parts of the underlying physical state are important for the given decision-making problem. Moreover, in hierarchical learning approaches the required information might change over time as increasingly competent actions become available. The same can be observed in biological systems where information about all muscle fibers is initially instrumental to generate strategies for coordinated movement. However, as such strategies become established and ready to be used, this low-level information does no longer have to be consciously taken into account when learning policies for new tasks. To achieve similar capabilities in artificial agents, state and knowledge representations should depend on the action set that is currently available, and become increasingly abstract as more higher-level policies become available as actions and less of the low-level action primitives are required. A small number of techniques for generating more compact state representations based on the actions and the reward function have been developed[4,8]. The work presented here builds on the ε reduction technique developed by Dean et al.[2] to derive representations in the form of state space 2

14 partitions that ensure that the utility of a policy learned in the reduced state space is within a fixed bound of the optimal policy. The work presented here extends the ε reduction technique by including policies as actions and thus using it to find approximate SMDP reductions. Furthermore it derives partitions for individual actions and composes them into representations for any given subset of the action space. This is further extended by permitting the definition of reward independent partitions the can be refined once the reward function is known. The remainder of this chapter provides an overview of previous work in hierarchical learning and stochastic processes and the contribution of the thesis. 1.1 Decision making under uncertainty One of the basic concepts in stochastic process is a control process in an environment in which there is uncertainty. Solving a control process[13] is considered from the perspective of an agent that acts in the environment. An agent can be a robot which navigates a house, a human executing a strategy, or a program which controls traffic lights. The goal of decision-making is to find a plan or a policy that maximizes the total benefit of acting in an environment over a period of time. Decision-making has 3

15 broad application in operation research, artificial intelligence, control theory, management and scheduling[19]. Uncertainty exists in almost all situations in real life. This issue plays an important part in scientific problems and engineering models. Figure 1.1: A sample robot navigation environment An example in Artificial intelligence is a robot that moves through a grid world like the mouse and maze problem. The robot has the ability of performing actions such as moving forward and turning by an angle, and the maze is an environment with different states as in Figure 1.1. The purpose of this representation is to provide the 4

16 information necessary to construct a navigation strategy for a given goal location. Uncertainty is present at all time in this environment. For example, when the robot s motors do not function as expected, moving the robot in a wrong direction or moving it too far. Furthermore the sensors of the robot can be unreliable and provide incorrect readings from the environment. A number of systems have the Markov property, and can be modeled as a MDP. While well-known algorithms exist to solve a MDP and find an optimal solution (policy), the state representation of these problems are often so large that these algorithms require a large amount of memory and time. 1.2 Models of planning The relationship between the time spent on planning and the time spent on executing a plan is a way to distinguish different planning models from each other. Usually, finding an optimal plan is time consuming. For this reason some planning methods are constructed in offline mode, which permits them to be performed on powerful computers. When a plan is constructed it can be located in a smaller computer to be executed on-line. The smaller 5

17 computer can be a robot with less memory and a slower processor. Off-line planning often assumes that complete knowledge of the environment is available and it considers all outcomes, even those that have a very small chance of occurring. Thus, if the number of states is large this process has a high time complexity. Unlike off-line planning, on-line planning does not assume complete knowledge of the environment and the agent tries to construct and refine a plan while acting in the world. In the extreme case, the agent starts to act with no initial plan and no model of the environment. This is particularly useful when the state space is large, as the agent often needs only the information for its next act and does not require complete knowledge of the environment. Reinforcement learning[6] is an example for these methods. 1.3 Temporal and state abstraction Semi Markov decision processes[21] were originally constructed in order to address the representations of a hierarchical action space in a MDP by considering the execution of sequences of actions i.e. policies. This new approach derivates optimal solution in the presence of more complex actions. 6

18 State abstraction is a group of methods where a single state represents a large group of states. State abstraction often involves a tradeoff between optimality and compactness and one of the questions that needs to be answered in abstracting the sate space is the relationship between a solution on the abstract model and a solution on the original model [1,2]. One of the other problems in state space reduction is that in a real situation the agent often does not know the task before executing an action. 1.4 Contribution of this thesis This thesis introduces a new approach for state space reduction. In particular it uses the techniques described in the following subsections to extend ε reduction[3],one of the recent methods in space reduction Temporal abstraction Actions can be considered to be primitive or highlevel. A primitive (low-level) action can take a constant amount of time but a high-level action takes varying amounts of time. Temporal abstraction refers to the use of high-level actions such as opening a door, which consists of several primitive actions like unlock, move and release. 7

19 This thesis uses the ability of using the multi-step actions instead of primitive actions in ε reduction method to reduce the size of the state space Action dependent decomposition One of the aspects of hierarchical learning is to construct a tree type structure on the state space, in which actions in higher level sets can be considered as policies in lower level sets. A second aspect of hierarchical learning approaches is that as new and more complex actions become available, low-level actions are no longer required to learn a task and thus can be ignored. The intuition here is that such a policy will involve fewer decision points and as a result, it can be learned substantially faster. To take full advantage of this limitation of the action space, it should also be reflected in the state representation. In particular, once low-level actions are ignored, much more abstract state representations should be sufficient to address the same tasks Reward independent decomposition This thesis addresses the state decomposition problem in two stages. It first assumes that the agent has no 8

20 knowledge of rewards associated with the state, and tries to find a policy without having the reward function in hand. In this method, optimality is not guaranteed but solving the MDP is much faster than the original one. In many real world situations the reward information is not known beforehand. In this case the state reduction can not be performed. This thesis provides a new method in state reduction while the reward is not available and proves that the reduced space is solvable for a certain type of asks. Moreover, it provides the possibility for reward-specific refinement to optimize the resulting representation once the reward information is available. 9

21 CHAPTER II FORMALISM In the MDP framework, a number of algorithms exists that provably converge to an optimal policy in both offline and on-line planning. In the off-line case, these include value iteration and policy iteration. This chapter introduces the Markov decision process model. In this model the environment is divided into different states. A set of actions relates these states and makes the model a stochastic process. For each state, action pair, a transition probability and a reward value are assigned. The goal is to find an optimal policy i.e. a policy that maximizes the utility of interacting with the environment. When almost optimal policies are acceptable, there are algorithms that converge to a policy with a bounded distance from the optimal policy. One way to find this type of policies is to find all the states that have almost the same properties, and group them as a new state. This method of state reduction is often called state aggregation. 10

22 State aggregation methods reduce the size of the space, and make the decision problem easier to solve. One technique for state aggregation is ε-reduction, which will be discussed in future chapters and forms the basics for the work presented here. The ε-reduction algorithm constructs a valid MDP model, which converges to a policy. 2.1 Markov Decision Processes A Markov decision processes (MDP) is a 4-tuple ( S, A, P, R) where S is the set of states, A is a set of actions available in each state, P is a transition probability function that assigns a value 0 p 1 to each state-action pair, and R is the reward function. A transition function is a map P : S A S [0,1] and usually is denoted by P ( s s, a), which is the probability that executing action a in state s will lead to state s. Similarly, a reward function is a map R : S A R and R ( s, a) denotes the reward gained by executing action a in state s. Figure 2.1 illustrates a MDP, with 10 states and 2 different actions. 11

23 Figure 2.1: A graphical view of a MDP A policy is a map π : S A, which assigns an action to a state. A stochastic process is said to satisfy the Markov property if for all t 0 < t 1 <...< t n < t and for all n, it is true that P ( + 1 sn+ 1 s0, a0, s1, a1..., sn, an ) = P( sn sn, an ). The usual way to measure the cost or utility in an environment is to compute a value function. A value function is a mapping V : S R that assigns a utility to each state. The aim is to find an optimal policy i.e. a policy that maximizes the value of interacting with the 12

24 environment. Throughout this thesis the reward criterion is used as the optimality criterion. The reward criterion is the expected discounted sum of rewards over a policy that is received by the agent and can be define as: N t γ R( st, π ( st )) t= 0 where s t indicates the state at time t, and γ [0,1] is a discount factor that indicates how time affects the utility measure The Value Function The value function provides a utility measure for each state. These values can be computed in order to find an optimal policy. Any policy π defines a value function V π t ( s) R( s, π ( s)) + = γ R( st, π ( st )) (2.1) t= 1 The Bellman equation creates a connection between the value of each state and the value of other states π π V ( s) = R( s, π ( s)) + γ P( s s, π ( s)) V ( s ) (2.2) Equation 2.2 shows the relation between the value function and the immediate reward received in the succeeding state[1]. s 13

25 Since the optimal policy assigns the best action to a state, the value function for the optimal policy can be defined as: * * V ( s) = max R( s, a) + γ P( s s, a) V ( s ) (2.3) a s Computing The Optimal Policy Two common methods for finding an optimal policy are value iteration and policy iteration. This section provides an algorithm for each of these methods Value Iteration This method starts with an arbitrary value function, such as V ( s) = R( s, ) for some a A, and uses it to find the 0 a next value function using the in equation: V i+ 1( s) = max R( s, a) + γ P( s s, a) Vi ( s ). (2.4) a s the optimal value function * V is formed when Vt + 1( s) = Vt ( s) for all s S. The corresponding optimal policy is: * * π ( s ) = arg max R( s, a) + γ P( s s, a) V ( s ). (2.5) a s 14

26 2.1.4 Policy Iteration While value iteration updates the value and the policy at each step, policy iteration finds a policy and tries to improve it until the policy can not be improved. Figure 2.2 describes a pseudo-code by which, policy iteration can be implemented [14]. procedure Policy iteration π 0 an aribitarary policy j 0 end continue true While(continue) compute V π π j+1 argmax(v π j ) a if (π j = π j+1 ) then return(π j ) continue false else j j +1 Figure 2.2: Policy iteration 15

27 2.2 State Abstraction State abstraction or state aggregation is a group of methods for classifying the states into groups and representing them with a hierarchy. Under circumstances, state abstraction methods can be used to construct more compact MDPs that permit faster learning optimal or approximate solution. To do this, these methods use the basic concepts of a MDP such as transition probabilities and reward function to represent a large class of states with a single state of the abstract space. The most important issues that show that the generated abstraction is a valid approximate MDP are: 1- The difference in the transition function the and reward function in both models has to be a small value. 2- For each policy on the original state space there must exist a policy in the abstract model. And if a state s is not reachable from state s in the abstract model, then there should not exist a policy that leads from s to s in the original state space. 16

28 CHAPTER III PREVIOUS WORK 3.1 SMDPs One of the approaches in treating temporal abstraction is to use the theory of semi Markov decision processes (SMDPs). The actions in SMDPs take a variable amount of time and are intended to model temporally extended actions, represented as a sequence of primary action. Figure 3.1 shows a SMDP that is derived from a MDP [18]. In this figure the top panel shows the state trajectory over discrete time in the MDP, and the lower panel shows that larger state changes in a SMDP. The filled circles indicates decision points when new action has to be selected while the empty circles in the SMDP represents states in which the previously selected multi-step action is still active. As can be seen a smaller number of decision have to be made in the SMDP which should accelerate learning. 17

29 Figure 3.1: Comparison between MDP and SMDP Policies A policy (option) in SMDPs is a triple o = (,π, β ) i I i [20], where I i is an initiation set, π i : S A [0,1] is a primary policy and β : S [0,1] is a termination condition. i When a policy is executed, actions are chosen according to π i until the policy terminates stochastically according to β i. The initiation set and termination condition of a policy limit the range of over which the policy needs to be defined and determine its termination. Given a set of i i policies, their initiation sets thus define a subset Π s for each s S. As single step actions are policies 18 As Π. s

30 The elements of A s are called primary actions and the elements of Π s are called multi-step actions. Given any set of multi-step actions, we consider the policy over those actions. In this case we need to generalize the definition of value function. The value of a state s S under an SMDP policy π o is defined as: V π ( s) = E[ rt γrt γ rt ε ( π, s, t)] (3.1) 0 where ε ( π, s, t) denotes the event of an action under o π being initiated in state s at time t and r t denotes the reward at time t. For any multi-step action, o i, the reward in state s is computed as: R i t+ 1 t+ 2 t+ 3 + t+ k 2 k 1 ( s, o ) = E[ r + γ r + γ r +... γ r ] (3.2) where t + k is the time that action o i terminates. The transition value for state s is defined by: k F s s oi = (, ) P( st m = s m = k st = s oi s +, ) γ S (3.3) = 1 k Using Bellman s equation we can compute the general policy. For any Markov policy π o, the state value function can be written as: V π 2 k 1 k π o ( s) = E[ r + γr + γ r γ r γ V ( s ) ε ( π, s, )] (3.4) t+ 1 t+ 2 t+ 3 t+ k + t+ k t 19

31 where k is the duration of the first multi-step action selected by π o, and the Bellman equation for (3.4) is π π V ( s) = E R( s, o) + F( s s, o) V ( s) s (3.5) The optimal value function is [18]: V * ( s) = max R( s, o) + F( s s, o) V o O s s * ( s ) (3.6) Example of a SMDP Consider the rooms problem, a grid world environment with five rooms and one hallway illustrated in Figure 3.2. The cells of the grid show the states of the environment. From any state the robot can perform one of the four actions up, down, left or right, which have a stochastic effect and will fail 50% of time i.e. with probability 1/2 they are successful, and the agent moves in any of the other three directions with probability 1/6. The reward for each state is zero. 20

32 Figure 3.2: The Rooms example as a grid world environment There is a hallways in this environment, designed to let the agent reach other rooms and the elevator. For each room there are policies π i which move the robot along the shortest path to the hallway or the other rooms. 21

Figure 3.3: The policy of one of the four rooms For example, the policy for one room shown in Figure 3.3. The termination condition for these policy is zero for states within the room and 1 for states out of the room.

33 Figure 3.3: The policy of one of the four rooms For example, the policy for one room shown in Figure 3.3. The termination condition for these policy is zero for states within the room and 1 for states out of the room. The initiation set I consists of the states in the room. As shown in Figure 3.2, there is more than one policy that leads from each state to other rooms or to the hallway. The arrows in figure 3.2 illustrate the result of different step action. When a multi step action is executed within a room, it will end in a state outside the room. 3.2 State Abstraction in MDPs In the previous chapter the concepts of MDPs and their related algorithms have been introduced. While representing a planning problem in stochastic domains with MDPs is 22

34 suitable, the complexity of the algorithms increases rapidly with the size of the state space. It is often possible to represent a MDP by an approximate MDP with smaller set of states and almost equivalent state transition and reward function. This will generally not grantee the existence of an optimal solution but should accelerate the process of finding a non-optimal solution. The result of the factorization can be explained by finding a partition of the state space where states in the same block have the same transition probability to other blocks. The basic idea of these methods has its origin in automata theory and stochastic processes. This section introduces a framework that introduces the concept of ε - homogenous partitions[3] in which the states in the same block have transition to states in other blocks as long as the difference in their probabilities is smaller than ε. Any ε -homogenous partition results in a MDP with a state space comprised of the blocks of the partition, and transitions from each block to any other block. In order to define these transitions, a bounded parameter MDP (BPMDP) is defined. A BPMDP is a MDP, in which the transition probabilities and rewards point to an interval instead of a single value. Bounded parameter MDPs are used in state aggregation for solving large MDPs and they can be used to find an approximate solution. The remainder of this chapter 23

35 discusses the following topics: formal definition of BPMDPs, ε -homogenous partitions and the methods for finding an optimal policy. 3.3 Bounded Parameter MDPs A bounded parameter MDP is a four tuple M ˆ = ( Sˆ, Aˆ, Pˆ, Rˆ ) where ˆ S and ˆ A are defined as for MDPs, and Pˆ, ˆ R are analogous to P and R in MDPs but assign closed intervals rather than single values to each state-action pair. That is, for any action a and states s, s S, the values of R ˆ( s, a) and P ˆ( s s, a) are both closed intervals [l,u] for l,u both real numbers with l u and in the case of Pˆ we require 0 l u 1 [2]. To ensure that Pˆ is well-defined we require that for any action a and state s, the sum of the lower bounds of P ˆ( s s, a) over all states s must be less than or equal to 1 while the upper bounds must sum to a value greater than or equal to 1. Figure 3.4 illustrates the state-transition diagram for a simple BPMDP with three states and one action. 24

36 Figure 3.4: The state transition diagram for BPMDP An interval value function ˆ V is a map from states to closed intervals. A BPMDP M ˆ = ( Sˆ, Aˆ, Pˆ, Rˆ ) induces an exact MDP M = ( S, A, P, R) where S = S ˆ and A = A ˆ, and for any action a and states s, s S, P ( s s, a) and R ( s, a) are in the range of P ˆ( s s, a) and R ˆ( s, a) respectively. In a BPMDP Mˆ, The interval value Vˆ π ( s) for state s is defined by the interval: [minvπ ( s),maxvπ ( s)] (3.7) Pˆ, Rˆ Pˆ, Rˆ 25

37 3.4 ε reduction Method Dean et al.[4] introduced a family of algorithms that take a MDP and a real value 0 ε 1 as an input and compute a bounded parameter MDP where each closed interval has a scope less than ε. The states in this MDP correspond to blocks of a partition of the state space in which the states in the same block have the same proprieties in terms of transitions and rewards. Let P = { B 1,...,B n } be a partition of the state space [4]. Definition 3.1: A partition P = { B 1,...,B n } of the state space of a MDP M has the property of ε approximate stochastic bisimulation homogeneity with respect to M for 0 ε 1 if and only if for each B i,b j P, for each a A and for each s, s : B i R( s, a) R( s, a) ε a A (3.8) and P ( s s, a) P( s s, a) ε. s s B B j j Definition 3.2: A partition P is a refinement of a partition P if and only if each block of P is a subset of 26

38 some block of P. In this case we say that P is coarser that P. Definition 3.3: The immediate reward partition is the partition in which two states s, s S are in the same block if they have the same rewards. Definition 3.4: The block B i of a partition P is ε stable [4] with respect to block B j if and only if for all actions a A and all states s, s Bi P ( s s, a) P( s s, a) ε. s s B B j j The ε model reduction algorithm first uses the immediate reward partition as an initial partition and checks the ε stability for each block of this partition until there are no unstable blocks left. For example, when block B i happens to be unstable with respect to block B j, the block B i will be replaced by a set of sub-blocks B i,..., B i k 1 such that each B i m is a maximal sub-block of B i that is ε stable with respect to B j. 27

39 Theorem 3.1: For ε > 0, the partition P founded by the ε reduction model algorithm from the MDP M,is coarser than, and thus no larger than M [4]. Once the ε stable blocks of the partition have been constructed, the transition and reward function between blocks can be defined. The transition of each block by definition is the interval with the bounds of maximum and minimum probabilities of all possible transitions from all states of a block to the states of another block. P ˆ ( B = i B j, a) min P( s s, a),max P( s s, a) s B s B j i s B s B j i (3.9) And similarly the reward for a block B j is: Rˆ ( B, a) = min R( s, a),max R( s, a) (3.10) j s B j s B j 28

40 CHAPTER IV Extension To ε reduction Method While Dean's ε reduction technique permits the derivation of appropriate state abstractions in the form of state space partitions, it poses several problems when being applied in practice. First, it heavily relies on complete knowledge of the reward structure of the problem. The goal of the original technique is to obtain policies with similar utility. In many practical problems, however, it is more im7portant to achieve the task goal than it is to do so in the optimal way. In other words, correctly representing the connectivity and ensuring the achievability of the task objective is often more important than the precision in the value function. To reflect this, the reduction technique can easily be extended to include separate thresholds ε and δ for the transition probabilities and the reward function, respectively. This makes it more flexible and permits emphasizing task achievement over the utility of the learned policy. The second important step is the capability of including the state abstraction technique into a hierarchical 29

41 learning scheme. This implies that it should be able to efficiently deal with increasing action spaces that over time include more temporally extended actions in the form of learned policies. To address this, the abstraction method should change the representation as such hierarchical changes are made. This chapter presents extensions to the reduction technique that permit the use of policies as actions within the SMDP framework and that allow for the efficient construction of final partitions for varying action sets. To achieve this while still guaranteeing similar bounds on the quality of a policy learned on the reduced state space, the basic technique has to be extended to account for actions that perform multiple transitions on the underlying state space. The final part of this chapter, discusses the space reduction when the reward function is not available. In this situations, refinement can be done using transition probabilities. This method also shows that when it is necessary to run different tasks in the same environment, refinement by transition probabilities has to be performed only for the first task and can subsequently augmented by task specific reward refinement. In this way the presented methods can further reduce the time complexity in situation when multiple tasks have to be learned in the same environment. 30

42 4.1 ε,δ Reduction for SMDP For a given MDP we construct the policies o = (,π, β ) i I i i i by defining sub-goals and finding the policies π i that leads to sub-goals from each state s S.The transition probability function F s s, o ) and the reward function ( i R s o ) for this state and policy can be computed with ( i equation 3.2 and 3.3. Discount and probabilities are folded here into a single value. As can be seen here, calculation of this property is significantly more complex than in the case of single step actions. However, the transition probability is a pure function of the option and can thus be completely precomputed at the time at which the policy itself is learned. As a result, only the discounted reward estimate has to be re-computed for each new learning task. The transition and reward criteria for constructing a partition over policies can be refined by: R( s, o ) R( s, o ) i i ε o O i and F ( s s, o ( s)) F( s s, o ( s)) δ (4.1) s i s B B j j i 31

43 4.2 Action Dependent Decomposition of ε,δ Reduction The intuition here is that complex policies will involve fewer decision points and as a result can be learned substantially faster. In hierarchical learning system, it is thus useful to remove primitive actions from consideration as more complex actions become available and capable of addressing new problem. On the other hand, such a limitation of the action space should also reflect in the state representation. In particular, once low-level actions are ignored, much more abstract state representations should be sufficient to address the same tasks. However, it is generally not known beforehand at which point lower level actions can be safely ignored without compromising the set of tasks that can be addressed. The state reduction technique has therefore to be flexible and able to adjust to changes in the action space efficiently without incurring the overhead of completely re-computing a partition. To permit such flexibility, the approach presented here derives partitions on a per-action basis and provides an efficient approach to construct overall partitions form these. In particular, it derives a ε,δ partition P i for each action o i. 32

44 Let M be a SMDP with n different actions o 1,...,on and let P 1,...,P n be the ε,δ partitions corresponding to each action. where P i = B i 1,...,B i m i { } for i W = { i oi O}. Define Φ=P 1 P 2... P n, the cross product of all σ partitions. Each element of Φ has the form ϕ j = (B 1 ( j ) σ 1,...,B n ( j ) n ) where σ i is a function with domain Φ and range 1,...,m i. Each element ϕ j Φ corresponds to a ~ σ i ( j) B Bi i A & =. Since B i k B i l = for all 1 k,l m i, { B j } is a partition over all actions. Given a particular subset of actions, a partition for the learning task can now be derived as the set of all nonempty blocks resulting form the intersection of the subsets for the participating actions. A block in the resulting partition can therefore be represented by a vector over the actions involved, where each entry indicates the index of the block within the corresponding single action partition. Once the initial blocks are constructed by the above algorithm, these blocks will be refined until they are ε stable according to Dean s method. Changes in the action set therefore do not require a recalculation of the individual partitions but only changes in the length of the vectors representing new states and a recalculation of the final refinement step. This means that changes in the action set can be performed efficiently and 33

45 a simple mechanism can be provided to use the previously learned value function even beyond the change of actions and to use it as a starting point for subsequent additional learning. This is particularly important if actions are added over time to permit refinement of the initially learned policy by permitting finer-grained decisions. 4.3 A Simple Example In this example we assume a grid world with a mobile robot which can perform four primitive deterministic actions: left, right, up and down. Reward for actions that lead the agent to another cell is assumed to be -1. In order to construct an option we define a policy with each action. The termination condition is hitting the wall and the policy repeats each action until it terminates. Figure 4.1 shows this scenario. 34

46 Figure 4.1: grid world for the example Figures 4.2 through 4.5 show the possible partitions for the four options. Let B i j be the block j for partition i derived by action o i. Then the cross product of these blocks Φ is a vector containing all possible combination of these blocks: Φ={B 1 1,B 1 2 } {B 2 1,B 2 2 } {B 3 1,B 3 2 } {B 4 1,B 4 2 } Figure 4.2: Partition for option left 35

47 Figure 4.3: Partition for option up Figure 4.4: Partition for option right 36

48 Figure 4.5: Partition for option down Figure 4.6: Intersection of partitions 37

49 The intersection partition has the elements: B 1 {1,2,3,4} = B 1 1 B 4 1 B 2 1 B 3 1 B 2 {1,2,3,4} = B 1 2 B 4 1 B 2 1 B 3 1 M Figure 4.6 illustrates the intersection of the partitions. These blocks form the initial blocks for the ε reduction technique. The result of refinement is illustrated in Figure 4.7. While performing an action on each state the result would be another block instead of a state so each block of Figure 4.7 can be considered a single state in the resulting BMDP. Figure 4.7: Final blocks of partition 38

50 4.4 Reward Independent Partitions Real environments usually do not provide all the necessary information for an agent, and the agent needs to find out these details by itself. For example, it is common that an agent does not have full information of the reward structure. In these situations, constructing the immediate reward partition is not possible and the partitions for ε,δ reduction have to be determined differently, an algorithm is introduced which drives partitions in two different phases. This reward independent partition method constructs the initial blocks by distinguishing terminal states for available actions from non-terminal states and refines them using the transition probabilities. If the reward structure is available this method further refines the initial partitions using the reward and transition criterion. The advantage of this construction is that the learning process can be done on-line while doing the reward abstraction. And also whenever there is a change in reward criteria, only the final refinement part will be recomputed. Definition 4.1: A subset C of the state space S, is called a terminal set under action a if P ( s s, a) = 0 for all s C and s C. 39

51 ( ) Definition 4.2: Let P n ( s s, a) denote the probability of first visit to state s from state s. That is, P ( n ) ( s s, a) = P( sn k = s sn k 1 s, sn k 2 s,..., sk 1 s, s k = s, a) Definition 4.3: For fixed states s and s, let * F ( s s, a) = P n= 1 * ( s s, a). The symbol F ( s s, a) is the ( n ) probability of ever visiting state s from state s. Proposition 4.1: A state s belongs to a terminal set with * respect to action a if F ( s s, a) = 1. Proposition 4.1 shows a direct way to find the terminal sets, i.e. the termination condition for each action. Once the terminal sets are constructed, the state space can be partitioned by transition probabilities using equation 4.1. In situations where the reward information is not available the reward independent method can be used to learn a policy without the need to determine the complete reward structure first. Proposition 4.2: For any policy π, for which the goal G can be represented as a conjunction of terminal sets (sub- 40

52 goals) of the available actions in the original MDP M, there is a policy π Φ in the reduced MDP M Φ that achieves G. If for each state in the M for which there exists a path to G, there exists such a path for which P s + s, π ( s )) > δ. ( t 1 t t Proof: The blocks of partition Φ = B, L, B } have the following property s1, s2 B i F ( s s, o δ s i ) F( s s, o s B 2 i ) B j j { 1 n 1 (4.2) Let Π be a set of all policies in a stochastic domain S. For every policy π that fulfills the requirements of the proposition, there exists a policy π Φ in partition space such that for each n ℵ, if there is a path of length n from state s 0 to a goal state G, under policy π, then there is a path for block containing G, under policy B s 0 containing s 0 to block π Φ. B G Case k = 1: if F G s 0, π ( s )) > δ then by condition (4.1) for ( 0 all s B s, F ( s s, π ( s )) F( s s, π ( s )) δ , thus s B G s B G s Bs F( G s, π( s0)) > F( G s0, π( s0)) δ 0. Define policy π Φ such 0 > that π Φ ( B s ) = π ( 0 ) then F BG Bs, π Φ ( Bs )) 0. 0 s ( >

53 Case k = n 1: Assume for each path of length less than or equal to n 1 that reaches state G from s 0 under policy π, there is a path under policy π Φ in the partition space. Case k = n : Each path that reaches with G from s 0 under policy π in n steps contains a path with n 1 steps, that reaches G from s 1 under policy π. By induction hypothesis, there is a policy π Φ that leads to B G from B s 1. Now if s 0 is an element of Bs Bs B n n L 1 s, the blocks 1 already chosen by path with length less than or equal n 1, then there is a policy that leads to B G from s 0 B under policy π and the policy π ) is already defined. But Φ Φ ( B s 0 if s0 Bs Bs L B n n 1 s then by induction hypothesis it has 1 only to be shown that there is a policy π Φ that fulfills the induction hypothesis and which leads from B s to B 0 s1 such that F B B, π ( B )) 0. By condition (4.1) ( s1 s > 0 Φ s0 s s B 1, 2 s0 F s s, π ( s )) F( s s, π ( s )) ( s B 0 0 G s B 0 G δ then F( B s1 B s0, π ( B Φ s0 )) = s B s1 s Bs 1 F( s s F( s s, π ( s F( s s 0, π ( s 0 0, π ( s 0 0 )) )) δ )) δ > 0 42

54 Therefore the proposition holds and thus for every policy with the goal defined by proposition there is a policy that achieves the goal in the reward independent partition. 4.5 Experimental Results Experiment 1: In order to compare the model minimization methods that are introduced in this thesis, several state spaces with different sizes have been examined with this algorithm. These examples follow the pattern illustrated in figure 4.8. The underlying actions are left, right, up and down and they are successful with the probability 0.9 and return to the same state with the probability 0.1. In this experiment the primitive actions are extended by repeating them until they hit a wall. Figure 4.9 illustrates the final blocks of partition in this experiment and charts 4.10 and 4.11, illustrate the run times in different phases of the algorithms. 43

55 Figure 4.8: The pattern for the experiments 44

56 Figure 4.9: The final blocks of partition for experiment 1 45

57 Figure 4.10: Run time for Dean s method with multi-step actions 46

58 Figure 4.11: Run time for the reward independent method 47

59 140 Original state space Partition space Value of Policy Number of Iterations Figure 4.12: Number of iterations for learning a policy Figure 4.12 illustrates the learning curve for a single state using value iteration in the original state and partition space. The reward and transition probabilities are discounted with γ = 0. 9 for performing the multi-step actions. These graphs suggest that the value of the learned policy in partition space is smaller. However, the time spent on learning in partition space is smaller than the time spent on learning in the original space. 48

60 In Figure 4.12 the number of partitions in Dean s method and the two stage method are compared. In order to construct a state space with larger number of states, we replicate the pattern in Figure 4.8 for several times and rerun the experiment Original Space Partition with Dean s Algorithm Partition with transition 2500 Number of blocks of partition Number of states Figure 4.13: Comparison of number of blocks Experiment 2: This experiment has been performed in order to compare the total partitioning time after the first trial. In order to 49

do so, the goal state has been situated in different positions and the run time of Dean s algorithm and the two stage algorithm are investigated. Figure 4.

61 do so, the goal state has been situated in different positions and the run time of Dean s algorithm and the two stage algorithm are investigated. Figure 4.13 shows an environment for this experiment: Figure 4.14: Environment for comparing run times of two algorithms with different trials This environment consists of three grid worlds, each of these grid worlds consists of different rooms. The termination states of the available actions are illustrated in black in Figure

62 The actions for each state are multi-step actions. These actions terminate when they reach a sub-goal in the same room. This experiment shows that, even though the run time of Dean s algorithm is lower than the run time of the reward independent algorithm, after the first trial the reward independent algorithm is much faster, as the transition partitioning for this algorithm is already done in the first trial is not necessary after it is performed once. Figures 4.15 and 4.16 show the difference in run times for 6 different trials. For the first trial the situation is similar to the previous experiment and the total run time in Dean s method is smaller than the total run time of the two stage method. When a different goal is located in the environment, the reward independent method does not need to refine the state space with the transition, as it has been done for the first task. On the other hand Dean s method needs to redefine the initial blocks for each task. As a result the total run time after the first task in the reward independent method is significantly smaller than the run time of Dean s method. This experiment also shows that policies can be learned in the absence of prior knowledge of reward. 51

63 Figure 4.15: Run time of Dean s algorithm for 6 trials 52

64 Figure 4.16: Run time of reward independent algorithm for 6 trials Learning experiments are performed with the reward of 100 at the goal and an intermediate reward of 30 that reward partition approaches the goal from particular direction. The value of learning policies over one state using value iteration with transition probability partition and intermediate reward are shown in Figure

65 Figure 4.17: Number of iterations for learning a policy 4.6 Conclusion Markov decision processes are useful way to model a stochastic environment, as there are well established algorithms to solve this type of model. However, these algorithms have a high time complexity when the state space is large. One way to address this problem is use the ε reduction to group all states that have a small difference in 54

66 transition probability and the reward function and consider them as a single state. This method falls into the category called state aggregation. ε reduction relies on the reward structure, however, there are situations that reward information is not known. Furthermore, ε reduction does not have the capability of using temporally extended actions by executing a multi-step action instead of a single step action. Temporally extended actions reduce number of the decision points where a new action has to be selected. As small number of decision has to be made the learning process should accelerate. To address this and to further expand the capabilities of ε reduction, this thesis improves this reduction technique with SMDP framework a creates a two phase method that can be used in the absence of reward information. In the environments that reward structure is not available the initial blocks of the partition are blocks that contain the terminating states for each available multi-step action. These blocks can be refined according to transition probabilities in order to define a valid BPMDP. However, when the reward information is available these blocks can be refined according to transition probabilities and reward function in order to find a better approximate solution. 55

67 The improvement to ε reduction by using temporally extended actions and refinement without using the reward information has the advantage in situation where multiple tasks have to be learned in the same environment. Once the transition refinement is performed on the state space, this refinement is not necessary for other tasks and future tasks can be learned in same reduced state space. 56

68 REFERENCES [1] Darwiche, A. and Goldszmidt, M Action networks: A framework for reasoning about actions and change under uncertainty. UAI-94, pp , Seattle. [2] Dean, T., Kaelbling, L. P., Kirman, J., and Nicholson, A Planning with deadlines in stochastic domains. AAAI-93, pp , Washington, D.C. [3] Dean, T. and Kanazawa, K A model for reasoning about persistence and causation. Comp. Intel. 5,(3), pp [4] Dean, T. and Wellman, M Planning and Control. Morgan Kaufmann, San Mateo. [5] Dearden, R. and Boutilier, C Integrating planning and execution in stochastic domains. UAI- 94, pp , Seattle. Howard, R. A Dynamic Probabilistic Systems. Wiley. [6] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research, 13, [7] Kushmerick, N., Hanks, S., and Weld, D An algorithm for probabilistic least-commitment planning. AAAI-94, pp , Seattle. [8] Mahadevan, S. (1996). Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results. Machine Learning, 22, [9] Mahadevan, S., Marchalleck, N., Das, T., & Gosavi, A. (1997). Self-Improving Factory Simulation using Continuous-Time Average Reward [10] Reinforcement Learning. Proceedings of the Fourteenth International Conference on Machine Learning, pp

Markov decision processes

CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only