State Space Reduction for Hierarchical Policy Formation

Size: px
Start display at page:

Download "State Space Reduction for Hierarchical Policy Formation"

Transcription

1 Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX State Space Reduction for Hierarchical Policy Formation Mehran Asadi Technical Report CSE This report was also submitted as an M.S. thesis.

2 STATE SPACE REDUCTION FOR HIERARCHICAL POLICY FORMATION The members of the Committee approve the master s Thesis of Mehran Asadi Dr. Manfred Huber Supervising Professor Dr. Diane J. Cook Dr. Lawrence B. Holder

3 Copyright by Mehran Asadi 2003 All Rights Reserved

4 STATE SPACE REDUCTION FOR HIERARCHICAL POLICY FORMATION by MEHRAN ASADI Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING THE UNIVERSITY OF TEXAS AT ARLINGTON December 2003

5 ACKNOWLEDGEMENTS I would like to thank my research advisor Dr. Huber, whose ideas were the origin of this research and his advice is always invaluable. He was the first person who truly thought me the basic concepts of A.I. and showed me the path of doing research in this field. He spent late nights writing papers with me, and all his suggestions made me think about our research more deeply. I also thank my committee for their careful judgment and their suggestions to improve this research. I must also thank my wife for all her support, without whom obtaining this degree was absolutely impossible. I do not know how I can express my appreciation for being a part of her life. Next, I must thank my mother and my mother in-law, their support helped me to continue my education and they encouraged me to stay focused at every single step of my studies. Finally I would like to thank my brother Mehrdad, my sister in-law Elham and their daughter Donya, whose presence makes me always happy. August 17,2003 iv

6 ABSTRACT STATE SPACE REDUCTION FOR HIERARCHICAL POLICY FORMATION Publication No. Mehran Asadi, M.S. The University of Texas at Arlington, 2003 Supervising Professor: Manfred Huber This thesis provides new techniques for abstracting the state space of a Markov Decision Process (MDP). These techniques extend one of the recent minimization models, which is known as ε-reduction, to construct a partition space that has a smaller number of states than the original MDP. v

7 As a result, learning of policies on the partition space should be faster than on the original state space. The technique presented here is to execute a policy instead of executing a single action, and to group all states which have a small difference in transition probabilities and reward function under a given policy. This method is similar to a SMDP by expanding the actions on the original MDP. When the reward structure is not known, the reward independent method is introduced for state aggregation. The reward independent method for state reduction is applied when reward information is not available and a theorem in this thesis proves the solvability for this type of partitions. Simulation on different state spaces shows that the policies in both MDP and this representation are very close and the total learning time in the partition space in our approach is much smaller than the total amount of time spent on learning in the original state space. vi

8 TABLE OF CONTENTS ACKNOWLEDGEMENTS... ABSTRACT... iv v LIST OF ILLUSTRATIONS... ix Chapter 1. INTRODUCTION Decision making under uncertainty Models of planning Temporal abstraction Contribution of this thesis Temporal abstraction Action dependent decomposition Reward independent decomposition FORMALISM Markov Decision Processes The Value function Computing the optimal policy Value iteration Policy iteration State Abstraction Introduction to SMDPs vii

9 2.3.1 Policies Example of SMDPs PREVIOUS WORK SMDPs Policies Example of SMDPs State abstraction in MDPs Bounded Parameter MDPs Epsilon Reduction Methods Extension to Epsilon Reduction Method Epsilon-delta extension to SMDPs Action Dependent Decomposition A Simple Example Reward Independent Partition Experimental Results Conclusion and Future work References Biographical information viii

10 LIST OF ILLUSTRATIONS Figure Page 1.1 A sample robot navigation environment A graph view of a MDP Policy Iteration Comparison between MDP and SMDP The rooms example as a gird world environment The policy of one of the rooms The state transition diagram BPMDP Grid world of the example Partition for option left Partition for option up Partition for option right Partition for option down Intersection of blocks of partition Final blocks of partition The pattern for the experiments The final blocks of partition for experiment Dean s method run time The reward independent method running time Number of iteration for learning a policy...49 ix

11 4.13 Comparison of number of blocks Environment for comparing run time Running for Dean s method Running time of reward independent method Number of iterations for learning a policy...55 xi

12 CHAPTER I INTRODUCTION Markov decision processes (MDPs) are useful way to model stochastic environments as there are well established algorithms to solve these models. Even though these algorithms find an optimal solution for the model, they suffer from the high time complexity when the number of decision points is large. To address increasingly complex problems, it is also necessary to find representations that are sufficient to address the task while remaining sufficiently compact to permit learning in an efficient manner. The importance here is put on the state space representation used in the decision-making process rather than on the one used for sensing and memory purposes. The idea is that a reduced representation for decision making combined with the use of increasingly competent actions in the form of policies can dramatically reduce the number of decision points and can lead to a much more efficient transfer of learning experiences across situations and tasks. A number of learning approaches have used specially designed state space representations to increase the 1

13 efficiency of learning [4,8]. Here, particular features are hand-designed based on the task domain and the capabilities of the learning agent. In autonomous systems, however, this is generally a difficult task since it is hard to anticipate which parts of the underlying physical state are important for the given decision-making problem. Moreover, in hierarchical learning approaches the required information might change over time as increasingly competent actions become available. The same can be observed in biological systems where information about all muscle fibers is initially instrumental to generate strategies for coordinated movement. However, as such strategies become established and ready to be used, this low-level information does no longer have to be consciously taken into account when learning policies for new tasks. To achieve similar capabilities in artificial agents, state and knowledge representations should depend on the action set that is currently available, and become increasingly abstract as more higher-level policies become available as actions and less of the low-level action primitives are required. A small number of techniques for generating more compact state representations based on the actions and the reward function have been developed[4,8]. The work presented here builds on the ε reduction technique developed by Dean et al.[2] to derive representations in the form of state space 2

14 partitions that ensure that the utility of a policy learned in the reduced state space is within a fixed bound of the optimal policy. The work presented here extends the ε reduction technique by including policies as actions and thus using it to find approximate SMDP reductions. Furthermore it derives partitions for individual actions and composes them into representations for any given subset of the action space. This is further extended by permitting the definition of reward independent partitions the can be refined once the reward function is known. The remainder of this chapter provides an overview of previous work in hierarchical learning and stochastic processes and the contribution of the thesis. 1.1 Decision making under uncertainty One of the basic concepts in stochastic process is a control process in an environment in which there is uncertainty. Solving a control process[13] is considered from the perspective of an agent that acts in the environment. An agent can be a robot which navigates a house, a human executing a strategy, or a program which controls traffic lights. The goal of decision-making is to find a plan or a policy that maximizes the total benefit of acting in an environment over a period of time. Decision-making has 3

15 broad application in operation research, artificial intelligence, control theory, management and scheduling[19]. Uncertainty exists in almost all situations in real life. This issue plays an important part in scientific problems and engineering models. Figure 1.1: A sample robot navigation environment An example in Artificial intelligence is a robot that moves through a grid world like the mouse and maze problem. The robot has the ability of performing actions such as moving forward and turning by an angle, and the maze is an environment with different states as in Figure 1.1. The purpose of this representation is to provide the 4

16 information necessary to construct a navigation strategy for a given goal location. Uncertainty is present at all time in this environment. For example, when the robot s motors do not function as expected, moving the robot in a wrong direction or moving it too far. Furthermore the sensors of the robot can be unreliable and provide incorrect readings from the environment. A number of systems have the Markov property, and can be modeled as a MDP. While well-known algorithms exist to solve a MDP and find an optimal solution (policy), the state representation of these problems are often so large that these algorithms require a large amount of memory and time. 1.2 Models of planning The relationship between the time spent on planning and the time spent on executing a plan is a way to distinguish different planning models from each other. Usually, finding an optimal plan is time consuming. For this reason some planning methods are constructed in offline mode, which permits them to be performed on powerful computers. When a plan is constructed it can be located in a smaller computer to be executed on-line. The smaller 5

17 computer can be a robot with less memory and a slower processor. Off-line planning often assumes that complete knowledge of the environment is available and it considers all outcomes, even those that have a very small chance of occurring. Thus, if the number of states is large this process has a high time complexity. Unlike off-line planning, on-line planning does not assume complete knowledge of the environment and the agent tries to construct and refine a plan while acting in the world. In the extreme case, the agent starts to act with no initial plan and no model of the environment. This is particularly useful when the state space is large, as the agent often needs only the information for its next act and does not require complete knowledge of the environment. Reinforcement learning[6] is an example for these methods. 1.3 Temporal and state abstraction Semi Markov decision processes[21] were originally constructed in order to address the representations of a hierarchical action space in a MDP by considering the execution of sequences of actions i.e. policies. This new approach derivates optimal solution in the presence of more complex actions. 6

18 State abstraction is a group of methods where a single state represents a large group of states. State abstraction often involves a tradeoff between optimality and compactness and one of the questions that needs to be answered in abstracting the sate space is the relationship between a solution on the abstract model and a solution on the original model [1,2]. One of the other problems in state space reduction is that in a real situation the agent often does not know the task before executing an action. 1.4 Contribution of this thesis This thesis introduces a new approach for state space reduction. In particular it uses the techniques described in the following subsections to extend ε reduction[3],one of the recent methods in space reduction Temporal abstraction Actions can be considered to be primitive or highlevel. A primitive (low-level) action can take a constant amount of time but a high-level action takes varying amounts of time. Temporal abstraction refers to the use of high-level actions such as opening a door, which consists of several primitive actions like unlock, move and release. 7

19 This thesis uses the ability of using the multi-step actions instead of primitive actions in ε reduction method to reduce the size of the state space Action dependent decomposition One of the aspects of hierarchical learning is to construct a tree type structure on the state space, in which actions in higher level sets can be considered as policies in lower level sets. A second aspect of hierarchical learning approaches is that as new and more complex actions become available, low-level actions are no longer required to learn a task and thus can be ignored. The intuition here is that such a policy will involve fewer decision points and as a result, it can be learned substantially faster. To take full advantage of this limitation of the action space, it should also be reflected in the state representation. In particular, once low-level actions are ignored, much more abstract state representations should be sufficient to address the same tasks Reward independent decomposition This thesis addresses the state decomposition problem in two stages. It first assumes that the agent has no 8

20 knowledge of rewards associated with the state, and tries to find a policy without having the reward function in hand. In this method, optimality is not guaranteed but solving the MDP is much faster than the original one. In many real world situations the reward information is not known beforehand. In this case the state reduction can not be performed. This thesis provides a new method in state reduction while the reward is not available and proves that the reduced space is solvable for a certain type of asks. Moreover, it provides the possibility for reward-specific refinement to optimize the resulting representation once the reward information is available. 9

21 CHAPTER II FORMALISM In the MDP framework, a number of algorithms exists that provably converge to an optimal policy in both offline and on-line planning. In the off-line case, these include value iteration and policy iteration. This chapter introduces the Markov decision process model. In this model the environment is divided into different states. A set of actions relates these states and makes the model a stochastic process. For each state, action pair, a transition probability and a reward value are assigned. The goal is to find an optimal policy i.e. a policy that maximizes the utility of interacting with the environment. When almost optimal policies are acceptable, there are algorithms that converge to a policy with a bounded distance from the optimal policy. One way to find this type of policies is to find all the states that have almost the same properties, and group them as a new state. This method of state reduction is often called state aggregation. 10

22 State aggregation methods reduce the size of the space, and make the decision problem easier to solve. One technique for state aggregation is ε-reduction, which will be discussed in future chapters and forms the basics for the work presented here. The ε-reduction algorithm constructs a valid MDP model, which converges to a policy. 2.1 Markov Decision Processes A Markov decision processes (MDP) is a 4-tuple ( S, A, P, R) where S is the set of states, A is a set of actions available in each state, P is a transition probability function that assigns a value 0 p 1 to each state-action pair, and R is the reward function. A transition function is a map P : S A S [0,1] and usually is denoted by P ( s s, a), which is the probability that executing action a in state s will lead to state s. Similarly, a reward function is a map R : S A R and R ( s, a) denotes the reward gained by executing action a in state s. Figure 2.1 illustrates a MDP, with 10 states and 2 different actions. 11

23 Figure 2.1: A graphical view of a MDP A policy is a map π : S A, which assigns an action to a state. A stochastic process is said to satisfy the Markov property if for all t 0 < t 1 <...< t n < t and for all n, it is true that P ( + 1 sn+ 1 s0, a0, s1, a1..., sn, an ) = P( sn sn, an ). The usual way to measure the cost or utility in an environment is to compute a value function. A value function is a mapping V : S R that assigns a utility to each state. The aim is to find an optimal policy i.e. a policy that maximizes the value of interacting with the 12

24 environment. Throughout this thesis the reward criterion is used as the optimality criterion. The reward criterion is the expected discounted sum of rewards over a policy that is received by the agent and can be define as: N t γ R( st, π ( st )) t= 0 where s t indicates the state at time t, and γ [0,1] is a discount factor that indicates how time affects the utility measure The Value Function The value function provides a utility measure for each state. These values can be computed in order to find an optimal policy. Any policy π defines a value function V π t ( s) R( s, π ( s)) + = γ R( st, π ( st )) (2.1) t= 1 The Bellman equation creates a connection between the value of each state and the value of other states π π V ( s) = R( s, π ( s)) + γ P( s s, π ( s)) V ( s ) (2.2) Equation 2.2 shows the relation between the value function and the immediate reward received in the succeeding state[1]. s 13

25 Since the optimal policy assigns the best action to a state, the value function for the optimal policy can be defined as: * * V ( s) = max R( s, a) + γ P( s s, a) V ( s ) (2.3) a s Computing The Optimal Policy Two common methods for finding an optimal policy are value iteration and policy iteration. This section provides an algorithm for each of these methods Value Iteration This method starts with an arbitrary value function, such as V ( s) = R( s, ) for some a A, and uses it to find the 0 a next value function using the in equation: V i+ 1( s) = max R( s, a) + γ P( s s, a) Vi ( s ). (2.4) a s the optimal value function * V is formed when Vt + 1( s) = Vt ( s) for all s S. The corresponding optimal policy is: * * π ( s ) = arg max R( s, a) + γ P( s s, a) V ( s ). (2.5) a s 14

26 2.1.4 Policy Iteration While value iteration updates the value and the policy at each step, policy iteration finds a policy and tries to improve it until the policy can not be improved. Figure 2.2 describes a pseudo-code by which, policy iteration can be implemented [14]. procedure Policy iteration π 0 an aribitarary policy j 0 end continue true While(continue) compute V π π j+1 argmax(v π j ) a if (π j = π j+1 ) then return(π j ) continue false else j j +1 Figure 2.2: Policy iteration 15

27 2.2 State Abstraction State abstraction or state aggregation is a group of methods for classifying the states into groups and representing them with a hierarchy. Under circumstances, state abstraction methods can be used to construct more compact MDPs that permit faster learning optimal or approximate solution. To do this, these methods use the basic concepts of a MDP such as transition probabilities and reward function to represent a large class of states with a single state of the abstract space. The most important issues that show that the generated abstraction is a valid approximate MDP are: 1- The difference in the transition function the and reward function in both models has to be a small value. 2- For each policy on the original state space there must exist a policy in the abstract model. And if a state s is not reachable from state s in the abstract model, then there should not exist a policy that leads from s to s in the original state space. 16

28 CHAPTER III PREVIOUS WORK 3.1 SMDPs One of the approaches in treating temporal abstraction is to use the theory of semi Markov decision processes (SMDPs). The actions in SMDPs take a variable amount of time and are intended to model temporally extended actions, represented as a sequence of primary action. Figure 3.1 shows a SMDP that is derived from a MDP [18]. In this figure the top panel shows the state trajectory over discrete time in the MDP, and the lower panel shows that larger state changes in a SMDP. The filled circles indicates decision points when new action has to be selected while the empty circles in the SMDP represents states in which the previously selected multi-step action is still active. As can be seen a smaller number of decision have to be made in the SMDP which should accelerate learning. 17

29 Figure 3.1: Comparison between MDP and SMDP Policies A policy (option) in SMDPs is a triple o = (,π, β ) i I i [20], where I i is an initiation set, π i : S A [0,1] is a primary policy and β : S [0,1] is a termination condition. i When a policy is executed, actions are chosen according to π i until the policy terminates stochastically according to β i. The initiation set and termination condition of a policy limit the range of over which the policy needs to be defined and determine its termination. Given a set of i i policies, their initiation sets thus define a subset Π s for each s S. As single step actions are policies 18 As Π. s

30 The elements of A s are called primary actions and the elements of Π s are called multi-step actions. Given any set of multi-step actions, we consider the policy over those actions. In this case we need to generalize the definition of value function. The value of a state s S under an SMDP policy π o is defined as: V π ( s) = E[ rt γrt γ rt ε ( π, s, t)] (3.1) 0 where ε ( π, s, t) denotes the event of an action under o π being initiated in state s at time t and r t denotes the reward at time t. For any multi-step action, o i, the reward in state s is computed as: R i t+ 1 t+ 2 t+ 3 + t+ k 2 k 1 ( s, o ) = E[ r + γ r + γ r +... γ r ] (3.2) where t + k is the time that action o i terminates. The transition value for state s is defined by: k F s s oi = (, ) P( st m = s m = k st = s oi s +, ) γ S (3.3) = 1 k Using Bellman s equation we can compute the general policy. For any Markov policy π o, the state value function can be written as: V π 2 k 1 k π o ( s) = E[ r + γr + γ r γ r γ V ( s ) ε ( π, s, )] (3.4) t+ 1 t+ 2 t+ 3 t+ k + t+ k t 19

31 where k is the duration of the first multi-step action selected by π o, and the Bellman equation for (3.4) is π π V ( s) = E R( s, o) + F( s s, o) V ( s) s (3.5) The optimal value function is [18]: V * ( s) = max R( s, o) + F( s s, o) V o O s s * ( s ) (3.6) Example of a SMDP Consider the rooms problem, a grid world environment with five rooms and one hallway illustrated in Figure 3.2. The cells of the grid show the states of the environment. From any state the robot can perform one of the four actions up, down, left or right, which have a stochastic effect and will fail 50% of time i.e. with probability 1/2 they are successful, and the agent moves in any of the other three directions with probability 1/6. The reward for each state is zero. 20

32 Figure 3.2: The Rooms example as a grid world environment There is a hallways in this environment, designed to let the agent reach other rooms and the elevator. For each room there are policies π i which move the robot along the shortest path to the hallway or the other rooms. 21

33 Figure 3.3: The policy of one of the four rooms For example, the policy for one room shown in Figure 3.3. The termination condition for these policy is zero for states within the room and 1 for states out of the room. The initiation set I consists of the states in the room. As shown in Figure 3.2, there is more than one policy that leads from each state to other rooms or to the hallway. The arrows in figure 3.2 illustrate the result of different step action. When a multi step action is executed within a room, it will end in a state outside the room. 3.2 State Abstraction in MDPs In the previous chapter the concepts of MDPs and their related algorithms have been introduced. While representing a planning problem in stochastic domains with MDPs is 22

34 suitable, the complexity of the algorithms increases rapidly with the size of the state space. It is often possible to represent a MDP by an approximate MDP with smaller set of states and almost equivalent state transition and reward function. This will generally not grantee the existence of an optimal solution but should accelerate the process of finding a non-optimal solution. The result of the factorization can be explained by finding a partition of the state space where states in the same block have the same transition probability to other blocks. The basic idea of these methods has its origin in automata theory and stochastic processes. This section introduces a framework that introduces the concept of ε - homogenous partitions[3] in which the states in the same block have transition to states in other blocks as long as the difference in their probabilities is smaller than ε. Any ε -homogenous partition results in a MDP with a state space comprised of the blocks of the partition, and transitions from each block to any other block. In order to define these transitions, a bounded parameter MDP (BPMDP) is defined. A BPMDP is a MDP, in which the transition probabilities and rewards point to an interval instead of a single value. Bounded parameter MDPs are used in state aggregation for solving large MDPs and they can be used to find an approximate solution. The remainder of this chapter 23

35 discusses the following topics: formal definition of BPMDPs, ε -homogenous partitions and the methods for finding an optimal policy. 3.3 Bounded Parameter MDPs A bounded parameter MDP is a four tuple M ˆ = ( Sˆ, Aˆ, Pˆ, Rˆ ) where ˆ S and ˆ A are defined as for MDPs, and Pˆ, ˆ R are analogous to P and R in MDPs but assign closed intervals rather than single values to each state-action pair. That is, for any action a and states s, s S, the values of R ˆ( s, a) and P ˆ( s s, a) are both closed intervals [l,u] for l,u both real numbers with l u and in the case of Pˆ we require 0 l u 1 [2]. To ensure that Pˆ is well-defined we require that for any action a and state s, the sum of the lower bounds of P ˆ( s s, a) over all states s must be less than or equal to 1 while the upper bounds must sum to a value greater than or equal to 1. Figure 3.4 illustrates the state-transition diagram for a simple BPMDP with three states and one action. 24

36 Figure 3.4: The state transition diagram for BPMDP An interval value function ˆ V is a map from states to closed intervals. A BPMDP M ˆ = ( Sˆ, Aˆ, Pˆ, Rˆ ) induces an exact MDP M = ( S, A, P, R) where S = S ˆ and A = A ˆ, and for any action a and states s, s S, P ( s s, a) and R ( s, a) are in the range of P ˆ( s s, a) and R ˆ( s, a) respectively. In a BPMDP Mˆ, The interval value Vˆ π ( s) for state s is defined by the interval: [minvπ ( s),maxvπ ( s)] (3.7) Pˆ, Rˆ Pˆ, Rˆ 25

37 3.4 ε reduction Method Dean et al.[4] introduced a family of algorithms that take a MDP and a real value 0 ε 1 as an input and compute a bounded parameter MDP where each closed interval has a scope less than ε. The states in this MDP correspond to blocks of a partition of the state space in which the states in the same block have the same proprieties in terms of transitions and rewards. Let P = { B 1,...,B n } be a partition of the state space [4]. Definition 3.1: A partition P = { B 1,...,B n } of the state space of a MDP M has the property of ε approximate stochastic bisimulation homogeneity with respect to M for 0 ε 1 if and only if for each B i,b j P, for each a A and for each s, s : B i R( s, a) R( s, a) ε a A (3.8) and P ( s s, a) P( s s, a) ε. s s B B j j Definition 3.2: A partition P is a refinement of a partition P if and only if each block of P is a subset of 26

38 some block of P. In this case we say that P is coarser that P. Definition 3.3: The immediate reward partition is the partition in which two states s, s S are in the same block if they have the same rewards. Definition 3.4: The block B i of a partition P is ε stable [4] with respect to block B j if and only if for all actions a A and all states s, s Bi P ( s s, a) P( s s, a) ε. s s B B j j The ε model reduction algorithm first uses the immediate reward partition as an initial partition and checks the ε stability for each block of this partition until there are no unstable blocks left. For example, when block B i happens to be unstable with respect to block B j, the block B i will be replaced by a set of sub-blocks B i,..., B i k 1 such that each B i m is a maximal sub-block of B i that is ε stable with respect to B j. 27

39 Theorem 3.1: For ε > 0, the partition P founded by the ε reduction model algorithm from the MDP M,is coarser than, and thus no larger than M [4]. Once the ε stable blocks of the partition have been constructed, the transition and reward function between blocks can be defined. The transition of each block by definition is the interval with the bounds of maximum and minimum probabilities of all possible transitions from all states of a block to the states of another block. P ˆ ( B = i B j, a) min P( s s, a),max P( s s, a) s B s B j i s B s B j i (3.9) And similarly the reward for a block B j is: Rˆ ( B, a) = min R( s, a),max R( s, a) (3.10) j s B j s B j 28

40 CHAPTER IV Extension To ε reduction Method While Dean's ε reduction technique permits the derivation of appropriate state abstractions in the form of state space partitions, it poses several problems when being applied in practice. First, it heavily relies on complete knowledge of the reward structure of the problem. The goal of the original technique is to obtain policies with similar utility. In many practical problems, however, it is more im7portant to achieve the task goal than it is to do so in the optimal way. In other words, correctly representing the connectivity and ensuring the achievability of the task objective is often more important than the precision in the value function. To reflect this, the reduction technique can easily be extended to include separate thresholds ε and δ for the transition probabilities and the reward function, respectively. This makes it more flexible and permits emphasizing task achievement over the utility of the learned policy. The second important step is the capability of including the state abstraction technique into a hierarchical 29

41 learning scheme. This implies that it should be able to efficiently deal with increasing action spaces that over time include more temporally extended actions in the form of learned policies. To address this, the abstraction method should change the representation as such hierarchical changes are made. This chapter presents extensions to the reduction technique that permit the use of policies as actions within the SMDP framework and that allow for the efficient construction of final partitions for varying action sets. To achieve this while still guaranteeing similar bounds on the quality of a policy learned on the reduced state space, the basic technique has to be extended to account for actions that perform multiple transitions on the underlying state space. The final part of this chapter, discusses the space reduction when the reward function is not available. In this situations, refinement can be done using transition probabilities. This method also shows that when it is necessary to run different tasks in the same environment, refinement by transition probabilities has to be performed only for the first task and can subsequently augmented by task specific reward refinement. In this way the presented methods can further reduce the time complexity in situation when multiple tasks have to be learned in the same environment. 30

42 4.1 ε,δ Reduction for SMDP For a given MDP we construct the policies o = (,π, β ) i I i i i by defining sub-goals and finding the policies π i that leads to sub-goals from each state s S.The transition probability function F s s, o ) and the reward function ( i R s o ) for this state and policy can be computed with ( i equation 3.2 and 3.3. Discount and probabilities are folded here into a single value. As can be seen here, calculation of this property is significantly more complex than in the case of single step actions. However, the transition probability is a pure function of the option and can thus be completely precomputed at the time at which the policy itself is learned. As a result, only the discounted reward estimate has to be re-computed for each new learning task. The transition and reward criteria for constructing a partition over policies can be refined by: R( s, o ) R( s, o ) i i ε o O i and F ( s s, o ( s)) F( s s, o ( s)) δ (4.1) s i s B B j j i 31

43 4.2 Action Dependent Decomposition of ε,δ Reduction The intuition here is that complex policies will involve fewer decision points and as a result can be learned substantially faster. In hierarchical learning system, it is thus useful to remove primitive actions from consideration as more complex actions become available and capable of addressing new problem. On the other hand, such a limitation of the action space should also reflect in the state representation. In particular, once low-level actions are ignored, much more abstract state representations should be sufficient to address the same tasks. However, it is generally not known beforehand at which point lower level actions can be safely ignored without compromising the set of tasks that can be addressed. The state reduction technique has therefore to be flexible and able to adjust to changes in the action space efficiently without incurring the overhead of completely re-computing a partition. To permit such flexibility, the approach presented here derives partitions on a per-action basis and provides an efficient approach to construct overall partitions form these. In particular, it derives a ε,δ partition P i for each action o i. 32

44 Let M be a SMDP with n different actions o 1,...,on and let P 1,...,P n be the ε,δ partitions corresponding to each action. where P i = B i 1,...,B i m i { } for i W = { i oi O}. Define Φ=P 1 P 2... P n, the cross product of all σ partitions. Each element of Φ has the form ϕ j = (B 1 ( j ) σ 1,...,B n ( j ) n ) where σ i is a function with domain Φ and range 1,...,m i. Each element ϕ j Φ corresponds to a ~ σ i ( j) B Bi i A & =. Since B i k B i l = for all 1 k,l m i, { B j } is a partition over all actions. Given a particular subset of actions, a partition for the learning task can now be derived as the set of all nonempty blocks resulting form the intersection of the subsets for the participating actions. A block in the resulting partition can therefore be represented by a vector over the actions involved, where each entry indicates the index of the block within the corresponding single action partition. Once the initial blocks are constructed by the above algorithm, these blocks will be refined until they are ε stable according to Dean s method. Changes in the action set therefore do not require a recalculation of the individual partitions but only changes in the length of the vectors representing new states and a recalculation of the final refinement step. This means that changes in the action set can be performed efficiently and 33

45 a simple mechanism can be provided to use the previously learned value function even beyond the change of actions and to use it as a starting point for subsequent additional learning. This is particularly important if actions are added over time to permit refinement of the initially learned policy by permitting finer-grained decisions. 4.3 A Simple Example In this example we assume a grid world with a mobile robot which can perform four primitive deterministic actions: left, right, up and down. Reward for actions that lead the agent to another cell is assumed to be -1. In order to construct an option we define a policy with each action. The termination condition is hitting the wall and the policy repeats each action until it terminates. Figure 4.1 shows this scenario. 34

46 Figure 4.1: grid world for the example Figures 4.2 through 4.5 show the possible partitions for the four options. Let B i j be the block j for partition i derived by action o i. Then the cross product of these blocks Φ is a vector containing all possible combination of these blocks: Φ={B 1 1,B 1 2 } {B 2 1,B 2 2 } {B 3 1,B 3 2 } {B 4 1,B 4 2 } Figure 4.2: Partition for option left 35

47 Figure 4.3: Partition for option up Figure 4.4: Partition for option right 36

48 Figure 4.5: Partition for option down Figure 4.6: Intersection of partitions 37

49 The intersection partition has the elements: B 1 {1,2,3,4} = B 1 1 B 4 1 B 2 1 B 3 1 B 2 {1,2,3,4} = B 1 2 B 4 1 B 2 1 B 3 1 M Figure 4.6 illustrates the intersection of the partitions. These blocks form the initial blocks for the ε reduction technique. The result of refinement is illustrated in Figure 4.7. While performing an action on each state the result would be another block instead of a state so each block of Figure 4.7 can be considered a single state in the resulting BMDP. Figure 4.7: Final blocks of partition 38

50 4.4 Reward Independent Partitions Real environments usually do not provide all the necessary information for an agent, and the agent needs to find out these details by itself. For example, it is common that an agent does not have full information of the reward structure. In these situations, constructing the immediate reward partition is not possible and the partitions for ε,δ reduction have to be determined differently, an algorithm is introduced which drives partitions in two different phases. This reward independent partition method constructs the initial blocks by distinguishing terminal states for available actions from non-terminal states and refines them using the transition probabilities. If the reward structure is available this method further refines the initial partitions using the reward and transition criterion. The advantage of this construction is that the learning process can be done on-line while doing the reward abstraction. And also whenever there is a change in reward criteria, only the final refinement part will be recomputed. Definition 4.1: A subset C of the state space S, is called a terminal set under action a if P ( s s, a) = 0 for all s C and s C. 39

51 ( ) Definition 4.2: Let P n ( s s, a) denote the probability of first visit to state s from state s. That is, P ( n ) ( s s, a) = P( sn k = s sn k 1 s, sn k 2 s,..., sk 1 s, s k = s, a) Definition 4.3: For fixed states s and s, let * F ( s s, a) = P n= 1 * ( s s, a). The symbol F ( s s, a) is the ( n ) probability of ever visiting state s from state s. Proposition 4.1: A state s belongs to a terminal set with * respect to action a if F ( s s, a) = 1. Proposition 4.1 shows a direct way to find the terminal sets, i.e. the termination condition for each action. Once the terminal sets are constructed, the state space can be partitioned by transition probabilities using equation 4.1. In situations where the reward information is not available the reward independent method can be used to learn a policy without the need to determine the complete reward structure first. Proposition 4.2: For any policy π, for which the goal G can be represented as a conjunction of terminal sets (sub- 40

52 goals) of the available actions in the original MDP M, there is a policy π Φ in the reduced MDP M Φ that achieves G. If for each state in the M for which there exists a path to G, there exists such a path for which P s + s, π ( s )) > δ. ( t 1 t t Proof: The blocks of partition Φ = B, L, B } have the following property s1, s2 B i F ( s s, o δ s i ) F( s s, o s B 2 i ) B j j { 1 n 1 (4.2) Let Π be a set of all policies in a stochastic domain S. For every policy π that fulfills the requirements of the proposition, there exists a policy π Φ in partition space such that for each n ℵ, if there is a path of length n from state s 0 to a goal state G, under policy π, then there is a path for block containing G, under policy B s 0 containing s 0 to block π Φ. B G Case k = 1: if F G s 0, π ( s )) > δ then by condition (4.1) for ( 0 all s B s, F ( s s, π ( s )) F( s s, π ( s )) δ , thus s B G s B G s Bs F( G s, π( s0)) > F( G s0, π( s0)) δ 0. Define policy π Φ such 0 > that π Φ ( B s ) = π ( 0 ) then F BG Bs, π Φ ( Bs )) 0. 0 s ( >

53 Case k = n 1: Assume for each path of length less than or equal to n 1 that reaches state G from s 0 under policy π, there is a path under policy π Φ in the partition space. Case k = n : Each path that reaches with G from s 0 under policy π in n steps contains a path with n 1 steps, that reaches G from s 1 under policy π. By induction hypothesis, there is a policy π Φ that leads to B G from B s 1. Now if s 0 is an element of Bs Bs B n n L 1 s, the blocks 1 already chosen by path with length less than or equal n 1, then there is a policy that leads to B G from s 0 B under policy π and the policy π ) is already defined. But Φ Φ ( B s 0 if s0 Bs Bs L B n n 1 s then by induction hypothesis it has 1 only to be shown that there is a policy π Φ that fulfills the induction hypothesis and which leads from B s to B 0 s1 such that F B B, π ( B )) 0. By condition (4.1) ( s1 s > 0 Φ s0 s s B 1, 2 s0 F s s, π ( s )) F( s s, π ( s )) ( s B 0 0 G s B 0 G δ then F( B s1 B s0, π ( B Φ s0 )) = s B s1 s Bs 1 F( s s F( s s, π ( s F( s s 0, π ( s 0 0, π ( s 0 0 )) )) δ )) δ > 0 42

54 Therefore the proposition holds and thus for every policy with the goal defined by proposition there is a policy that achieves the goal in the reward independent partition. 4.5 Experimental Results Experiment 1: In order to compare the model minimization methods that are introduced in this thesis, several state spaces with different sizes have been examined with this algorithm. These examples follow the pattern illustrated in figure 4.8. The underlying actions are left, right, up and down and they are successful with the probability 0.9 and return to the same state with the probability 0.1. In this experiment the primitive actions are extended by repeating them until they hit a wall. Figure 4.9 illustrates the final blocks of partition in this experiment and charts 4.10 and 4.11, illustrate the run times in different phases of the algorithms. 43

55 Figure 4.8: The pattern for the experiments 44

56 Figure 4.9: The final blocks of partition for experiment 1 45

57 Figure 4.10: Run time for Dean s method with multi-step actions 46

58 Figure 4.11: Run time for the reward independent method 47

59 140 Original state space Partition space Value of Policy Number of Iterations Figure 4.12: Number of iterations for learning a policy Figure 4.12 illustrates the learning curve for a single state using value iteration in the original state and partition space. The reward and transition probabilities are discounted with γ = 0. 9 for performing the multi-step actions. These graphs suggest that the value of the learned policy in partition space is smaller. However, the time spent on learning in partition space is smaller than the time spent on learning in the original space. 48

60 In Figure 4.12 the number of partitions in Dean s method and the two stage method are compared. In order to construct a state space with larger number of states, we replicate the pattern in Figure 4.8 for several times and rerun the experiment Original Space Partition with Dean s Algorithm Partition with transition 2500 Number of blocks of partition Number of states Figure 4.13: Comparison of number of blocks Experiment 2: This experiment has been performed in order to compare the total partitioning time after the first trial. In order to 49

61 do so, the goal state has been situated in different positions and the run time of Dean s algorithm and the two stage algorithm are investigated. Figure 4.13 shows an environment for this experiment: Figure 4.14: Environment for comparing run times of two algorithms with different trials This environment consists of three grid worlds, each of these grid worlds consists of different rooms. The termination states of the available actions are illustrated in black in Figure

62 The actions for each state are multi-step actions. These actions terminate when they reach a sub-goal in the same room. This experiment shows that, even though the run time of Dean s algorithm is lower than the run time of the reward independent algorithm, after the first trial the reward independent algorithm is much faster, as the transition partitioning for this algorithm is already done in the first trial is not necessary after it is performed once. Figures 4.15 and 4.16 show the difference in run times for 6 different trials. For the first trial the situation is similar to the previous experiment and the total run time in Dean s method is smaller than the total run time of the two stage method. When a different goal is located in the environment, the reward independent method does not need to refine the state space with the transition, as it has been done for the first task. On the other hand Dean s method needs to redefine the initial blocks for each task. As a result the total run time after the first task in the reward independent method is significantly smaller than the run time of Dean s method. This experiment also shows that policies can be learned in the absence of prior knowledge of reward. 51

63 Figure 4.15: Run time of Dean s algorithm for 6 trials 52

64 Figure 4.16: Run time of reward independent algorithm for 6 trials Learning experiments are performed with the reward of 100 at the goal and an intermediate reward of 30 that reward partition approaches the goal from particular direction. The value of learning policies over one state using value iteration with transition probability partition and intermediate reward are shown in Figure

65 Figure 4.17: Number of iterations for learning a policy 4.6 Conclusion Markov decision processes are useful way to model a stochastic environment, as there are well established algorithms to solve this type of model. However, these algorithms have a high time complexity when the state space is large. One way to address this problem is use the ε reduction to group all states that have a small difference in 54

66 transition probability and the reward function and consider them as a single state. This method falls into the category called state aggregation. ε reduction relies on the reward structure, however, there are situations that reward information is not known. Furthermore, ε reduction does not have the capability of using temporally extended actions by executing a multi-step action instead of a single step action. Temporally extended actions reduce number of the decision points where a new action has to be selected. As small number of decision has to be made the learning process should accelerate. To address this and to further expand the capabilities of ε reduction, this thesis improves this reduction technique with SMDP framework a creates a two phase method that can be used in the absence of reward information. In the environments that reward structure is not available the initial blocks of the partition are blocks that contain the terminating states for each available multi-step action. These blocks can be refined according to transition probabilities in order to define a valid BPMDP. However, when the reward information is available these blocks can be refined according to transition probabilities and reward function in order to find a better approximate solution. 55

67 The improvement to ε reduction by using temporally extended actions and refinement without using the reward information has the advantage in situation where multiple tasks have to be learned in the same environment. Once the transition refinement is performed on the state space, this refinement is not necessary for other tasks and future tasks can be learned in same reduced state space. 56

68 REFERENCES [1] Darwiche, A. and Goldszmidt, M Action networks: A framework for reasoning about actions and change under uncertainty. UAI-94, pp , Seattle. [2] Dean, T., Kaelbling, L. P., Kirman, J., and Nicholson, A Planning with deadlines in stochastic domains. AAAI-93, pp , Washington, D.C. [3] Dean, T. and Kanazawa, K A model for reasoning about persistence and causation. Comp. Intel. 5,(3), pp [4] Dean, T. and Wellman, M Planning and Control. Morgan Kaufmann, San Mateo. [5] Dearden, R. and Boutilier, C Integrating planning and execution in stochastic domains. UAI- 94, pp , Seattle. Howard, R. A Dynamic Probabilistic Systems. Wiley. [6] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research, 13, [7] Kushmerick, N., Hanks, S., and Weld, D An algorithm for probabilistic least-commitment planning. AAAI-94, pp , Seattle. [8] Mahadevan, S. (1996). Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results. Machine Learning, 22, [9] Mahadevan, S., Marchalleck, N., Das, T., & Gosavi, A. (1997). Self-Improving Factory Simulation using Continuous-Time Average Reward [10] Reinforcement Learning. Proceedings of the Fourteenth International Conference on Machine Learning, pp

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33 Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04- June, 2004 Department of Computer Science University of Massachusetts

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Raman

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Abstraction in Predictive State Representations

Abstraction in Predictive State Representations Abstraction in Predictive State Representations Vishal Soni and Satinder Singh Computer Science and Engineering University of Michigan, Ann Arbor {soniv,baveja}@umich.edu Abstract Most work on Predictive

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Robust Motion Planning using Markov Decision Processes and Quadtree Decomposition

Robust Motion Planning using Markov Decision Processes and Quadtree Decomposition Robust Motion Planning using Markov Decision Processes and Quadtree Decomposition Julien Burlet, Olivier Aycard 1 and Thierry Fraichard 2 Inria Rhône-Alpes & Gravir Lab., Grenoble (FR) {julien.burlet &

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning

Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning Craig Boutilier, Ronen I. Brafman, and Christopher Geib Department of Computer

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Planning Under Uncertainty: Structural Assumptions and Computational Leverage

Planning Under Uncertainty: Structural Assumptions and Computational Leverage Planning Under Uncertainty: Structural Assumptions and Computational Leverage Craig Boutilier Dept. of Comp. Science Univ. of British Columbia Vancouver, BC V6T 1Z4 Tel. (604) 822-4632 Fax. (604) 822-5485

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Solving Stochastic Planning Problems With Large State and Action Spaces

Solving Stochastic Planning Problems With Large State and Action Spaces Solving Stochastic Planning Problems With Large State and Action Spaces Thomas Dean, Robert Givan, and Kee-Eung Kim Thomas Dean and Kee-Eung Kim Robert Givan Department of Computer Science Department of

More information

Predictive Timing Models

Predictive Timing Models Predictive Timing Models Pierre-Luc Bacon McGill University pbacon@cs.mcgill.ca Borja Balle McGill University bballe@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca Abstract We consider

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Introduction to Spring 2006 Artificial Intelligence Practice Final

Introduction to Spring 2006 Artificial Intelligence Practice Final NAME: SID#: Login: Sec: 1 CS 188 Introduction to Spring 2006 Artificial Intelligence Practice Final You have 180 minutes. The exam is open-book, open-notes, no electronics other than basic calculators.

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems Daniel Meyer-Delius 1, Christian Plagemann 1, Georg von Wichert 2, Wendelin Feiten 2, Gisbert Lawitzky 2, and

More information

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems Daniel Meyer-Delius 1, Christian Plagemann 1, Georg von Wichert 2, Wendelin Feiten 2, Gisbert Lawitzky 2, and

More information

CS221 Practice Midterm

CS221 Practice Midterm CS221 Practice Midterm Autumn 2012 1 ther Midterms The following pages are excerpts from similar classes midterms. The content is similar to what we ve been covering this quarter, so that it should be

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

The Markov Decision Process (MDP) model

The Markov Decision Process (MDP) model Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Discrete planning (an introduction)

Discrete planning (an introduction) Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Probabilistic Model Checking and Strategy Synthesis for Robot Navigation

Probabilistic Model Checking and Strategy Synthesis for Robot Navigation Probabilistic Model Checking and Strategy Synthesis for Robot Navigation Dave Parker University of Birmingham (joint work with Bruno Lacerda, Nick Hawes) AIMS CDT, Oxford, May 2015 Overview Probabilistic

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Equivalence Notions and Model Minimization in Markov Decision Processes

Equivalence Notions and Model Minimization in Markov Decision Processes Equivalence Notions and Model Minimization in Markov Decision Processes Robert Givan, Thomas Dean, and Matthew Greig Robert Givan and Matthew Greig School of Electrical and Computer Engineering Purdue

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

1 [15 points] Search Strategies

1 [15 points] Search Strategies Probabilistic Foundations of Artificial Intelligence Final Exam Date: 29 January 2013 Time limit: 120 minutes Number of pages: 12 You can use the back of the pages if you run out of space. strictly forbidden.

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

Towards Faster Planning with Continuous Resources in Stochastic Domains

Towards Faster Planning with Continuous Resources in Stochastic Domains Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information