agent1 c c c c c c c c c c c agent2 c agent3 c c c c c c c c c c c indirect reward an episode a detour indirect reward an episode direct reward
|
|
- Ralf Fowler
- 6 years ago
- Views:
Transcription
1 Rationality of Reward Sharing in Multi-agent Reinforcement Learning Kazuteru Miyazaki and Shigenobu Kobayashi Tokyo Institute of Technology, Department of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, 4259, Nagatsuta, Midori-ku, Yokohama, 22{852 JAPAN Abstract. In multi-agent reinforcement learning systems, it is important to share a reward among all agents. We focus on the Rationality Theorem of Prot Sharing [5] and analyze how to share a reward among all prot sharing agents. When an agent gets a direct reward R (R > ), an indirect reward R ( ) is given to the other agents. We have derived the necessary and sucient condition to preserve the rationality as follows; < M M W ( ( M )W )(n )L ; where M and L are the maximum number of conicting all rules and rational rules in the same sensory input, W and W are the maximum episode length of a direct and an indirect-reward agents, and n is the number of agents. This theory is derived by avoiding the least desirable situation whose expected reward per an action is zero. Therefore, if we use this theorem, we can experience several ecient aspects of reward sharing. Through numerical examples, we conrm the eectiveness of this theorem. Introduction To achieve cooperation in multi-agent systems is a very desirable goal. In recent years, the bottom-up approach to multi-agent systems,which contrast remarkably with the top-down approach in DAI(Distributed Articial Intelligence), has prevailed. Recently, an approach to realize cooperation by reinforcement learning is notable. There is much literature [2, 9,, 8, 7,, 2] on multi-agent reinforcement learning. Q-learning [] and Classier System [4] are used in [2, 9, 7] and [, 8], respectively. Previous works [, 2] compare Prot Sharing [] with Q- learning in the pursuit problem [] and the cranes control problem [2]. These papers [, 2] claim that Prot Sharing is suitable for multi-agent reinforcement learning systems. In multi-agent environments where there is no negative reward, it is important to share a reward among all agents. Conventional work has used ad hoc sharing schemes. Though reward sharing may contribute to improve learning speeds and qualities, it is possible to damage system behavior. Especially, it
2 is important to preserve the rationality condition that expected reward per an action is larger than zero. In this paper, we aim to preserve the rationality condition in multi-agent environments where there is no negative reward. We focus on the Rationality Theorem of Prot Sharing [5] and analyze how to share a reward among all prot sharing agents. We show the necessary and sucient condition to preserve the rationality condition in multi-agent prot sharing systems. If we use this theorem, we can experience several ecient aspects of reward sharing without the least desirable situation where expected reward per an action is zero. Section 2 describes the problem, the method and notations. Section presents the necessary and sucient condition to preserve the rationality condition in multi-agent prot sharing systems. Section 4 shows numerical examples to understand the theorem. Section 5 is conclusions. 2 The Domain 2. Problem Formulation Consider n (n > ) agents in an unknown environment. At each discrete time step, agent i (i = ; 2; :::; n) is selected from n agents based on the selection probabilities P i (P i > ; i= n P i = ), and it senses the environment and performs an action. The agent senses a set of discrete attribute-value pairs and performs an action in M discrete varieties. We denote agent i's sensory inputs as x i ; y i ; and its actions as a i ; b i ;. A sensory input and action pair are called a rule. We denote a rule `if x then a' as! xa. The function that maps sensory inputs to actions is called a policy. We call a policy rational if and only if expected reward per an action is larger than zero. The policy that maximizes the expected reward per an action is called an optimal policy. When the n th agent ( < n n) has a special sensory input on condition that (n ) agents have special sensory inputs at some time step, the n th agent gets a direct reward R (R > ) and the other (n ) agents get an indirect reward R ( ). We call the n th agent the direct-reward agent and the other (n ) agents indirect-reward agents. We do not have any information about the n and the special sensory input. Furthermore, nobody (including reward designers) knows whether (n ) agents except for the n th agent are important or not. A set of n agents that are necessary for getting a direct reward is called the goal-agent set. In order to preserve the rationality condition that expected reward per an action is larger than zero, all agents in a goal-agent set must learn a rational policy. We show an example of direct and indirect rewards in a pursuit problem (Fig.). There are hunter agents (H,H,...,H5) and one prey agent (Pry). When H moves down (Fig.a) and 4 hunter agents surround the prey agent as shown in gure b, the direct reward is given to H and the indirect reward is given to the other agents (H,H2,...,H5). The number of agents that the indirect reward is given to is 5 (= ) because we have no information about the n (= 4) and
3 H4 H H a) b) Pry H2 H5 H H H Pry H2 H5 H4 H Fig.. The pursuit problem to explain direct and indirect rewards. the special sensory input. Traditionally, in this case, the direct reward is given to H,H,H2 and H. It means that we know what agents are important to catch the prey. However, we cannot always have such information (for example, H5 may be important). Therefore, our formulation is more general than those that have been used previously. In general, we can consider another multi-agent environments where several agents sense the environemnt and perform their actions asynchronously. In this case, there are several problem, for example, how to set reward fuction and how to decide priority of their actions. Therefore, we have taken more basic environments where an agent senses the environment and performs its action at each discrete time step. Furthermore, we do not take negative rewards and do not use reward values. Though such rewards are given to learning agents in most reinforcemnet learning systems, it is dicult to decide appropriate reward values and negative rewards. Therefore, we have regarded a reward as a good signal only. 2.2 Prot Sharing in Multi-agent Environments The purpose of this paper is to guarantee the rationality of Prot Sharing (PS) [] in the multi-agent environments discussed above. When a reward is given to an agent, PS reinforces rules on an episode, that is a sequence of rules selected between rewards, at once. In multi-agent environments, an episode is interpreted by each agent. For example, when agents select the rule sequence (! x a,! x2 a 2,! y2 a 2,! z a,! x2 a 2,! y a,! y2 b 2,! z2 b 2 and! x b ) (Fig.2) and have special sensory inputs (for getting a reward), it contains the episode ( xa!! ya), ( x2a2!! y2a2! x2a2y2b2!! z2b2) and (! za! xb) for agent, 2 and, respectively (Fig.). In this case, agent gets a direct reward and the other agent get an indirect reward. We call a subsequence of an episode a detour when the sensory input of the rst selecting rule and the sensory output of the last selecting rule are the same though both rules are dierent. For example, the episode (! x2 a 2! y2 a 2! x2 a 2
4 agent agent2 agent X X2 X a y X2 X a2 y y2 X a2 X Z2 Z a X X2 Z y a2 X2 Z Z y2 Z Xi, yi, Zi ; sensory inputs of agent i ai, bi ; actions of agent i a b2 Z Z X b2 Z2 Z2 X2 Z b X X reward Fig. 2. A rule sequence when agents select an action in some order. agent ccccccccccc a detour agent2 c a2 X a y a X2 a2 y2 X2 a2 y2 Z2 an episode an episode b2 b2 indirect reward indirect reward agent ccccccccccc Z a X an episode b direct reward Xi, yi, Zi ; sensory inputs of agent i ai, bi ; actions of agent i Fig.. Three episodes and one detour in gure 2.! y 2 b 2! z 2 b 2 ) of agent 2 contains the detour (! y 2 a 2! x 2 a 2 ) (Fig.). The rules that always exist on a detour do not contribute to get a direct reward. We call a rule ineective if and only if it always exists on a detour. Otherwise, a rule is called eective. For example, in the detour (! y 2 a 2 ; x! 2 a 2 ),! y 2 a 2 is an ineective rule and! x 2 a 2 is an eective rule because! x 2 a 2 is not exist on a detour in the rst sensory input of the episode (! x 2 a 2! y 2 a 2 x! 2 a 2 y! 2 b 2! z 2 b 2 ). We call a function that shares a reward among rules on an episode a reinforcement function. The term f i denotes a reinforcement value for the rule selected at i step before a reward is acquired. The weight S! of rule! r i is reinforced by ri S! = S ri! + f i for an episode (! r Wa! r i! r! r ) where W a is the length ri of an episode called reinforcement interval. When a reward f is given to the agent, we use the following reinforcement function that satises the Rationality
5 Theorem of PS [5], f n = M f n; n = ; 2; ; W a : () where,(f,w a ) is (R,W ) for the direct-reward agent and (R,W (W W)) for indirect-reward agents. For example, in gure, the weight of! x a and! y a are reinforced by S! = x a S! + x a M R and S! = y a S! +( y a M )2 R, respectively. 2. Properties of the Target Environments The paper [2] claims that two main problems should be considered in multi-agent reinforcement learning systems. One is perceptual aliasing problem [2] which is due to the agent's sensory limitation. The other is uncertainty of state transition problem which is due to the concurrent learning among the agents [8, ]. We can treat these problems by two confusions []. We call indistinction of state values a type confusion. Figure 4a is an example of the type confusion. In this example, the state value (v) is the minimum a) b) v(a)=2 2 S irrational rational a 2 v()=5 5 step S a b b 4 v(b)=8 v(4)= 4 ; sensory input ; rule ; reward Fig. 4. a)an example of the type confusion. b)an example of the type 2 confusion. step to a reward. The value of state a and b are 2 and 8, respectively. Though state a and b are dierent states, the agent senses them as the same sensory input (state ). If the agent takes state a and b equally, the value of state becomes 5 (= ). Therefore the value of state is higher than the value of state 4 (it is ). If the agent uses state values, it would like to move left in state. However the agent should move right in state. It means that the agent learns the irrational policy where it only transits between state b and. We call indistinction of rational and irrational rules a type 2 confusion. Figure 4b is an example of the type 2 confusion. Though the action to move up in state a is irrational, it is rational in state b. Since the agent senses state a and b as the same sensory input (state ), the action to move up in state is
6 type 2 confusion non exist type confusion non exist class PS QL class 2 PS QL ~ class PS ~ QL ~ Fig. 5. Three classes in multi-agent environments. regarded as rational. If the agent learns the action to move right in state S, it takes the irrational policy that only transits between state a and 2. In general, if there is a type 2 confusion in some sensory input, there is a type confusion in it. By these confusions, we can classify multi-agent environments into three classes as shown in gure 5. Markov Decision Processes (MDPs), that are treated by many reinforcement learning systems, belong to class. Q- learning (QL) [], that guarantees the acquisition of an optimal policy in MDPs, is deceived by the type confusion since it uses state values to make a policy. PS is not deceived by the confusion since it does not use state values. On the other hand, reinforcement learning systems that use the weight (including QL and PS) are deceived by the type 2 confusion. The Rationality Theorem of PS [5] guarantees the acquisition of a rational policy in the class where there is no type 2 confusion. Since we use the theorem, we must assume the class. Remark that the class contains a part of two main problems (perceptual aliasing and uncertainty of state transition) in multi-agent reinforcement learning systems. For example, the environment where positive state transition probabilities do not change zero does not have any type 2 confusion, even if there are perceptual aliasing and uncertainty of state transition problems in the environment. Therefore, our target environments are meaningful as multi-agent environments. In the next section, we extend the Rationality Theorem of PS to the multi-agent environments. Rationality Theorem in Multi-agent Reinforcement Learning. The Basic Idea In this section, we derive the necessary and sucient condition to preserve the rationality condition in the multi-agent prot sharing systems discussed at previous section. We call eective rules that will be learned by the Rationality Theorem of PS rational rules, and the others irrational rules. We show the relationship
7 rational rules irrational rules = effective rules ineffective rules > Fig.. The relationship of rational, irrational, eective and ineective rules. of these rules in gure. Irrational rules should not be reinforced when they conict with rational rules. When >, some irrational rule might be judged to be an eective rule. Therefore, it is important to suppress all irrational rules in eective rules. In order to preserve the rationality condition, all irrational rules in a goalagent set must be suppressed. On the other hand, if a goal-agent set is constructed by the agents that all irrational rules have been suppressed, we can preserve the rationality condition. Therefore, we derive the necessary and suf- cient condition about the range of to suppress all irrational rules in some goal-agent set. First, we characterize a conict structure where it is the most dicult to suppress irrational rules. For two conict structures A and B, we say A is more dicult than B when the range of that can suppress any irrational rule of A is included in B. Second, we derive a necessary and sucient condition about the range of to suppress any irrational rule for the most dicult conict structure. Last, it is extended to any conict structure..2 Proposal of the Rationality Theorem in Multi-agent Reinforcement Learning Lemma The most dicult conict structure. The most dicult conict structure has only one irrational rule with a self-loop. Proof is shown in Appendix A. Figure 7 is the most dicult conict structure where only one irrational rule with a self-loop conicts with L rational rules. Lemma 2 Suppressing only one irrational rule with a self-loop. Only one irrational rule with a self-loop in some goal-agent set can be suppressed if and only if < M MW ( ( M ) W )(n )L ; (2) where M is the maximum number of conicting rules in the same sensory input, L is the maximum number of conicting rational rules, W is the maximum episode length of a direct-reward agent, W is the reinforcement interval
8 L... senory input at t+ senory input at t rational rule irrational rule Fig. 7. The most dicult conict structure. of indirect-reward agents and n is the number of agents. Proof is shown in Appendix B. By using the law of transitivity, the following theorem is directly derived from these lemmas. Theorem Rationality theorem in multi-agent reinforcement learning. Any irrational rule in some goal-agent set can be suppressed if and only if < M MW ( ( M ) W )(n )L ; () where M is the maximum number of conicting rules in the same sensory input, L is the maximum number of conicting rational rules, W is the maximum episode length of a direct-reward agent, W is the reinforcement interval of indirect-reward agents, and n is the number of agents.. The Meaning of Theorem Theorem is derived by avoiding the least desirable situation where expected reward per an action is zero. Therefore, if we use this theorem, we can experience multiple ecient aspects of indirect rewards including improvement of learning speeds and qualities. We cannot know the number of L in general. However, in practice, we can set L = M. We cannot know the number of W in general. However, in practice, we can set = if the length of an episode is larger than W. If we set L = M and W = W, theorem is simplied as follows; < (MW )(n ) : (4)
9 a) S2 A2 G2 5% S A 7 8 8% G A G S b) S G2 A 5% S A A2 S2 G 8% A G S Fig. 8. Roulette-like environments used in numerical example. 4 Numerical Example 4. Environments Consider roulette-like environments in gure 8. There are and 4 learning agents in roulette a) and b), respectively. The initial state of agent i (A i ) is S i. The number shown in the center of both roulettes (from to 8 or ) is given to each agent as a sensory input. There are two actions for each agent; move right (2% failure) or move left (5% failure). If an action fails, the agent cannot move. There is no situation where another agent gets the same sensory input. At each discrete time step, A i is selected based on the selection probabilities P i (P i > ; i= n P i = ). (P; P; P2) is (.9,.5,.5) for roulette a), and (P; P; P2; P; P4) is (.72,.4,.4,.2) for roulette b). When A i reaches the goal i (G i ), P i sets : and P j (j = i) are modied proportionally. When A R reaches G R on condition that A i (i = R) have reached G i, the direct reward R (= :) is given to A R and the indirect rewards R are given to the other agents. When some agent gets the direct reward or A i reaches G j (j = i), all agents return to the initial state shown in gure 8. The initial weights for all rules are :. If all agents learn the policy `move right in any sensory input', it is optimal. If at least two agents learn the policy `move left in the initial state' or `move right in the initial state, and move left in the right side of the initial state', it is irrational. When an optimal policy does not have been destroyed in episodes, the learning is judged to be succeessful. We will stop the learning if agent, and 2 learn the policy `move left in the initial state' or the number of actions are larger than thousand. Initially, we set W =. If the length of an episode is larger than, we set = (see section.). From equation (4), we set < :74::: for roulette a) and < :::: for roulette b) to preserve the rationality condition.
10 learning speeds learning speeds 4.2 Results We investigate the learning qualities and speeds in both roulettes. We show them in table. Table a and b correspond to roulette a) and b), respectively. The Table. The learning qualities and speeds in roulette a) and b). a) b) learning qualities irrational optimal learning speeds Ave. S.D learning qualities irrational optimal learning speeds Ave. S.D a) 2 b) upper bound of Theorem upper bound of Theorem Fig. 9. Details of the learning speeds in roulette a) and b). learning qualities are evaluated by acquiring times of irrational or optimal policies in a thousand dierent trials where random seeds are changing. The learning speeds are evaluated by total action numbers to learn a thousand optimal
11 policies. Figure 9a and 9b are details of learning speeds in roulette a) and b), respectively. Though theorem satises the rationality, it does not guarantee the optimality. However, in both roulettes, the optimal policy always has been learned beyond the range of theorem. In roulette a), = : makes the learning speed the best (Tab.a, Fig.9a). On the other hand, if we set :4, there is a case that irrational policies have been learned. For example, consider the case that A ; A and A 2 in roulette a) get three rule sequences in gure. In this case, if we set = :, A ; A and A 2 A A A reward Si =.5 =., Si Si, Si A A A reward A A A reward total A A A < > Good!! No Good Fig.. An example of rule sequences in roulette a). approach to G 2 ; G and G, respectively. If we set < :74:::, such irrational policies do not have been learned. Furthermore, we have improved the learning speeds. Though it is possible to improve the learning speeds beyond the range of theorem, we should preserve theorem to guarantee the rationality in all environments.
12 In roulette b), A cannot learn anything because there is no G. Therefore, if we set =, the optimal policy does not have been learned (Tab.b). In this case, we should use the indirect reward. Table b and gure 9b show that = :2 makes the learning speed the best. On the other hand, if we set :, there is a case that irrational policies have been learned. It is an important property of the indirect reward that the learning qualities exceed those of the case of =. Though theorem only guarantees the rationality, numerical examples show that it is possible to improve the learning speeds and qualities. 5 Conclusions In most multi-agent reinforcement learning systems, reinforcement learning methods for single-agent systems are used. Though it is important to share a reward among all agents in multi-agent reinforcement learning systems, conventional work has used ad hoc sharing schemes. In this paper, we focus on the Rationality Theorem of Prot Sharing and analyze how to share a reward among all prot sharing agents. We show the necessary and sucient condition to preserve the rationality condition in multiagent reinforcement learning systems. If we use this theorem, we can experience multiple ecient aspects of reward sharing, including improvement of learning speeds and qualities, without the least desirable situation where expected reward per an action is zero. Our future projects include : ) to analyze the improvement eect of learning speeds and qualities, 2) to extend to other elds of reinforcement learning, and ) to nd ecient real world applications. Appendix A Proof of Lemma a (b=,c=) b (b=2,c=) c (b=2,c=2) d (b=2,c=2) e (b=,c=) f (b=,c=2) g (b=,c=) sensory input at t+ sensory input at t rule Fig.. Conict structures used in proof. Reinforcement of an irrational rule makes it dicult to preserve the rationality condition under any. Therefore, the diculty of a conict structure is monotonic to the number of reinforcements for irrational rules. We enumerate conict
13 structures according to the branching factor b (the number of state-transitions in the same sensory input), the conict factor c (the number of conicting rules in it), and the count of reinforcements for irrational rules. Though we set L =, we can extend to any number easily. b = : It is clearly not dicult since there are no conicts (Fig.a). b = 2 : When there are no conicts (Fig.b), it is the same as b =. We divide structures of c = 2 into two subcases. One contains a self-loop (Fig.c), and the other does not (Fig.d). In the case of gure c, there is a possibility that the self-loop rule is selected repeatedly, while the non-self-loop rule is selected once at maximum. Therefore, if the self-loop rule is irrational, it will be reinforced more than the irrational rule of gure d. b : When there are no conicts (Fig.e), it is the same as b =. Consider the structure of c = 2 (Fig.f). Although the most dicult case is that the conict structure has an irrational rule as a self-loop, even such a structure is less dicult than gure c. Considering the structure of c = (Fig.g), two conict rules are irrational because of L =. Therefore, an expected number of reinforcements for one irrational rule is less than of gure f. Similarly, conict structures of b > are less dicult than gure c. From the above discussion, it is concluded that the most dicult conict structure is gure c. Q.E.D. B Proof of Lemma 2 r k r k r2 k... r L k r k r k r2 k... EEE r k L r k n- r k r k n- 2 n-... r n- L k Fig. 2. The most dicult multi-agent structure. For any reinforcement interval k (k = ; ; :::; W ) in some goal-agent set, we show that there is j (j = ; 2; :::; L) satisfying the following condition, S k ij > Sk i ; (5) where Sij k is the weight of jth rational rule (rk ij ) in agent i (Fig.2). First, we consider the ratio of the selection number of rij k to ri. k When n = and L rational rules for each agent are selected by all agents in turn, the minimum of the ratio is maximized (Tab.2). In this case, the following ratio holds (Fig.), r k ij : r k i = : (n )L ()
14 agent number c Table 2. An example of rule sequence for all agents on some k. If `X changes to O' or `O2 changes to O' in this table, the learning in the agent that can select the changing rule occurs more easily. c ~ ~ ; rational rule ~ ; irrational L ~ L the acquisition number of rewards sequence 4 4 : : 4 : 4 sequence2 EEE : : : Fig.. An example of rule sequence. Sequence is more easily learned than sequence 2. Second, we consider weights given to rij k and ri. k When the agent that gets the direct reward senses no similar sensory input in W, the weight given to rij k R is minimized. It is in k = W. On the other hand, when agents that get MW the indirect reward sense the same sensory input in W, the weight given to r k i is maximized. It is R M M ( ( M )W ) in W W. Therefore, it is necessary for satisfying condition (5) to hold the following condition, that is, R > R M MW < M ( ( M )W )(n )L; (7) M M W ( ( M )W )(n )L : (8)
15 It is clearly the sucient condition. Q.E.D. References. Arai, S., Miyazaki, K., and Kobayashi, S.: Generating Cooperative Behavior by Multi-Agent Reinforcement Learning, Proc. of the th European Workshop on Learning Robots, pp.4-57 (997). 2. Arai, S., Miyazaki, K., and Kobayashi, S.: Cranes Control Using Multi-agent Reinforcement Learning, International Conference on Intelligent Autonomous System 5, pp.5-42 (998).. Grefenstette, J. J.: Credit Assignment in Rule Discovery Systems Based on Genetic Algorithms, Machine Learning Vol., pp (988). 4. Holland, J. H.: Escaping Brittleness: The Possibilities of General-Purpose Learning Algorithms Applied to Parallel Rule-Based Sysems, in R.S.Michalsky et al. (eds.), Machine Learning: An Articial Intelligence Approach, Vol.2, pp Morgan Kaufman (98). 5. Miyazaki, K., Yamamura, M., and Kobayashi, S.: On the Rationality of Prot Sharing in Reinforcement Learning, Proc. of the rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp (994).. Miyazaki, K., and Kobayashi, S.: Learning Deterministic Policies in Partially Observable Markov Decision Processes, International Conference on Intelligent Autonomous System 5, pp (998). 7. Ono, N., Ikeda, O. and Rahmani, A.T. : Synthesis of Herding and Specialized Behavior by Modular Q-learning Animats, Proc. of the ALIFE V Poster Presentations, pp.2- (99). 8. Sen, S. and Sekaran, M. : Multiagent Coordination with Learning Classier Systems, in Weiss, G. and Sen, S.(eds.), Adaption and Learning in Multi-agent systems, Berlin, Heidelberg. Springer Verlag, pp.28-2(995). 9. Tan, M. : Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents, Proc. of the th International Conference on Machine Learning, pp.- 7 (99).. Watkins, C. J. H., and Dayan, P.: Technical note: Q-learning, Machine Learning Vol.8, pp.55-8(992).. Weiss, G. : Learning to Coordinate Actions in Multi-Agent Systems, Proc. of the th International Joint Conference on Articial Intelligence, pp.- (99). 2. Whitehead, S. D. and Balland, D. H.: Active perception and Reinforcement Learning, Proc. of the 7th International Conference on Machine Learning, pp.2-9(99). This article was processed using the LaT E X macro package with LLNCS style
Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationComputing the acceptability semantics. London SW7 2BZ, UK, Nicosia P.O. Box 537, Cyprus,
Computing the acceptability semantics Francesca Toni 1 and Antonios C. Kakas 2 1 Department of Computing, Imperial College, 180 Queen's Gate, London SW7 2BZ, UK, ft@doc.ic.ac.uk 2 Department of Computer
More informationOptimal Convergence in Multi-Agent MDPs
Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,
More informationCarnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].
Gradient Descent Approaches to Neural-Net-Based Solutions of the Hamilton-Jacobi-Bellman Equation Remi Munos, Leemon C. Baird and Andrew W. Moore Robotics Institute and Computer Science Department, Carnegie
More informationIn: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationSTATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo
STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo Department of Electrical and Computer Engineering, Nagoya Institute of Technology
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationAijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.
Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationRule Acquisition for Cognitive Agents by Using Estimation of Distribution Algorithms
Rule cquisition for Cognitive gents by Using Estimation of Distribution lgorithms Tokue Nishimura and Hisashi Handa Graduate School of Natural Science and Technology, Okayama University Okayama 700-8530,
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationApproximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.
An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu
More informationIncremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests
Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University
More informationagent agent agent... agent
Towards self-organising action selection Mark Humphrys University of Cambridge, Computer Laboratory http://www.cl.cam.ac.uk/users/mh10006 Abstract Systems with multiple parallel goals (e.g. autonomous
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationShort Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:
Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed
More informationThe size of decision table can be understood in terms of both cardinality of A, denoted by card (A), and the number of equivalence classes of IND (A),
Attribute Set Decomposition of Decision Tables Dominik Slezak Warsaw University Banacha 2, 02-097 Warsaw Phone: +48 (22) 658-34-49 Fax: +48 (22) 658-34-48 Email: slezak@alfa.mimuw.edu.pl ABSTRACT: Approach
More informationLocal Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient
Efficient Non-Linear Control by Combining Q-learning with Local Linear Controllers Hajime Kimura Λ Tokyo Institute of Technology gen@fe.dis.titech.ac.jp Shigenobu Kobayashi Tokyo Institute of Technology
More informationWojciech Penczek. Polish Academy of Sciences, Warsaw, Poland. and. Institute of Informatics, Siedlce, Poland.
A local approach to modal logic for multi-agent systems? Wojciech Penczek 1 Institute of Computer Science Polish Academy of Sciences, Warsaw, Poland and 2 Akademia Podlaska Institute of Informatics, Siedlce,
More informationCS343 Artificial Intelligence
CS343 Artificial Intelligence Prof: Department of Computer Science The University of Texas at Austin Good Afternoon, Colleagues Good Afternoon, Colleagues Are there any questions? Logistics Problems with
More informationImproving Coordination via Emergent Communication in Cooperative Multiagent Systems: A Genetic Network Programming Approach
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Improving Coordination via Emergent Communication in Cooperative Multiagent Systems:
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationEvolving Behavioral Strategies. in Predators and Prey. Thomas Haynes and Sandip Sen. Department of Mathematical & Computer Sciences
Evolving Behavioral Strategies in Predators and Prey Thomas Haynes and Sandip Sen Department of Mathematical & Computer Sciences The University of Tulsa e{mail: [haynes,sandip]@euler.mcs.utulsa.edu? Abstract.
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationA Neuro-Fuzzy Scheme for Integrated Input Fuzzy Set Selection and Optimal Fuzzy Rule Generation for Classification
A Neuro-Fuzzy Scheme for Integrated Input Fuzzy Set Selection and Optimal Fuzzy Rule Generation for Classification Santanu Sen 1 and Tandra Pal 2 1 Tejas Networks India Ltd., Bangalore - 560078, India
More informationComparison of Evolutionary Methods for. Smoother Evolution. 2-4 Hikaridai, Seika-cho. Soraku-gun, Kyoto , JAPAN
Comparison of Evolutionary Methods for Smoother Evolution Tomofumi Hikage, Hitoshi Hemmi, and Katsunori Shimohara NTT Communication Science Laboratories 2-4 Hikaridai, Seika-cho Soraku-gun, Kyoto 619-0237,
More informationQUICR-learning for Multi-Agent Coordination
QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop
More informationPerceptron. (c) Marcin Sydow. Summary. Perceptron
Topics covered by this lecture: Neuron and its properties Mathematical model of neuron: as a classier ' Learning Rule (Delta Rule) Neuron Human neural system has been a natural source of inspiration for
More informationMarkov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525
Markov Models and Reinforcement Learning Stephen G. Ware CSCI 4525 / 5525 Camera Vacuum World (CVW) 2 discrete rooms with cameras that detect dirt. A mobile robot with a vacuum. The goal is to ensure both
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationA Preference Semantics. for Ground Nonmonotonic Modal Logics. logics, a family of nonmonotonic modal logics obtained by means of a
A Preference Semantics for Ground Nonmonotonic Modal Logics Daniele Nardi and Riccardo Rosati Dipartimento di Informatica e Sistemistica, Universita di Roma \La Sapienza", Via Salaria 113, I-00198 Roma,
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationAbstract: Q-learning as well as other learning paradigms depend strongly on the representation
Ecient Q-Learning by Division of Labor Michael Herrmann RIKEN, Lab. f. Information Representation 2-1 Hirosawa, Wakoshi 351-01, Japan michael@sponge.riken.go.jp Ralf Der Universitat Leipzig, Inst. f. Informatik
More information0 o 1 i B C D 0/1 0/ /1
A Comparison of Dominance Mechanisms and Simple Mutation on Non-Stationary Problems Jonathan Lewis,? Emma Hart, Graeme Ritchie Department of Articial Intelligence, University of Edinburgh, Edinburgh EH
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationLearning Without State-Estimation in Partially Observable. Tommi Jaakkola. Department of Brain and Cognitive Sciences (E10)
Learning Without State-Estimation in Partially Observable Markovian Decision Processes Satinder P. Singh singh@psyche.mit.edu Tommi Jaakkola tommi@psyche.mit.edu Michael I. Jordan jordan@psyche.mit.edu
More informationCoarticulation in Markov Decision Processes
Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer
More informationMultiagent Systems Motivation. Multiagent Systems Terminology Basics Shapley value Representation. 10.
Multiagent Systems July 2, 2014 10. Coalition Formation Multiagent Systems 10. Coalition Formation B. Nebel, C. Becker-Asano, S. Wöl Albert-Ludwigs-Universität Freiburg July 2, 2014 10.1 Motivation 10.2
More informationA Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games
International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:
More informationΦ 12 Φ 21 Φ 22 Φ L2 Φ LL Φ L1
Self-organiing Symbol Acquisition and Motion Generation based on Dynamics-based Information Processing System Masafumi Okada, Daisuke Nakamura 2 and Yoshihiko Nakamura 2 Dept. of Mechanical Science and
More informationDual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks
Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement
More informationCoarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33
Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04- June, 2004 Department of Computer Science University of Massachusetts
More informationZero-Sum Stochastic Games An algorithmic review
Zero-Sum Stochastic Games An algorithmic review Emmanuel Hyon LIP6/Paris Nanterre with N Yemele and L Perrotin Rosario November 2017 Final Meeting Dygame Dygame Project Amstic Outline 1 Introduction Static
More informationξ, ξ ξ Data number ξ 1
Polynomial Design of Dynamics-based Information Processing System Masafumi OKADA and Yoshihiko NAKAMURA Univ. of Tokyo, 7-- Hongo Bunkyo-ku, JAPAN Abstract. For the development of the intelligent robot
More informationLearning Communication for Multi-agent Systems
Learning Communication for Multi-agent Systems C. Lee Giles 1,2 and Kam-Chuen Jim 2 1 School of Information Sciences & Technology and Computer Science and Engineering The Pennsylvania State University
More informationPlanning by Probabilistic Inference
Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationStatistical Analysis of Backtracking on. Inconsistent CSPs? Irina Rish and Daniel Frost.
Statistical Analysis of Backtracking on Inconsistent CSPs Irina Rish and Daniel Frost Department of Information and Computer Science University of California, Irvine, CA 92697-3425 firinar,frostg@ics.uci.edu
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationLearning Delayed Influence of Dynamical Systems From Interpretation Transition
Learning Delayed Influence of Dynamical Systems From Interpretation Transition Tony Ribeiro 1, Morgan Magnin 2,3, and Katsumi Inoue 1,2 1 The Graduate University for Advanced Studies (Sokendai), 2-1-2
More informationA Cooperation Strategy for Decision-making Agents
A Cooperation Strategy for Decision-making Agents Eduardo Camponogara Institute for Complex Engineered Systems Carnegie Mellon University Pittsburgh, PA 523-389 camponog+@cs.cmu.edu Abstract The style
More informationLearning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics
Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationMultiagent (Deep) Reinforcement Learning
Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of
More informationBest-Response Multiagent Learning in Non-Stationary Environments
Best-Response Multiagent Learning in Non-Stationary Environments Michael Weinberg Jeffrey S. Rosenschein School of Engineering and Computer Science Heew University Jerusalem, Israel fmwmw,jeffg@cs.huji.ac.il
More informationDynamics in the dynamic walk of a quadruped robot. Hiroshi Kimura. University of Tohoku. Aramaki Aoba, Aoba-ku, Sendai 980, Japan
Dynamics in the dynamic walk of a quadruped robot Hiroshi Kimura Department of Mechanical Engineering II University of Tohoku Aramaki Aoba, Aoba-ku, Sendai 980, Japan Isao Shimoyama and Hirofumi Miura
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationThe Reinforcement Learning Problem
The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence
More informationDempster's Rule of Combination is. #P -complete. Pekka Orponen. Department of Computer Science, University of Helsinki
Dempster's Rule of Combination is #P -complete Pekka Orponen Department of Computer Science, University of Helsinki eollisuuskatu 23, SF{00510 Helsinki, Finland Abstract We consider the complexity of combining
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationLaboratory for Computer Science. Abstract. We examine the problem of nding a good expert from a sequence of experts. Each expert
Picking the Best Expert from a Sequence Ruth Bergman Laboratory for Articial Intelligence Massachusetts Institute of Technology Cambridge, MA 0239 Ronald L. Rivest y Laboratory for Computer Science Massachusetts
More informationCS 4649/7649 Robot Intelligence: Planning
CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides
More informationOn Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets
On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash
More informationThe Markov Decision Process Extraction Network
The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739
More informationGeorey J. Gordon. Carnegie Mellon University. Pittsburgh PA Bellman-Ford single-destination shortest paths algorithm
Stable Function Approximation in Dynamic Programming Georey J. Gordon Computer Science Department Carnegie Mellon University Pittsburgh PA 53 ggordon@cs.cmu.edu Abstract The success of reinforcement learning
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationRECURSION EQUATION FOR
Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationA reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation
A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN
More informationShigetaka Fujita. Rokkodai, Nada, Kobe 657, Japan. Haruhiko Nishimura. Yashiro-cho, Kato-gun, Hyogo , Japan. Abstract
KOBE-TH-94-07 HUIS-94-03 November 1994 An Evolutionary Approach to Associative Memory in Recurrent Neural Networks Shigetaka Fujita Graduate School of Science and Technology Kobe University Rokkodai, Nada,
More information1 Introduction Duality transformations have provided a useful tool for investigating many theories both in the continuum and on the lattice. The term
SWAT/102 U(1) Lattice Gauge theory and its Dual P. K. Coyle a, I. G. Halliday b and P. Suranyi c a Racah Institute of Physics, Hebrew University of Jerusalem, Jerusalem 91904, Israel. b Department ofphysics,
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More informationAbstract. This paper discusses polynomial-time reductions from Hamiltonian Circuit (HC),
SAT-Variable Complexity of Hard Combinatorial Problems Kazuo Iwama and Shuichi Miyazaki Department of Computer Science and Communication Engineering Kyushu University, Hakozaki, Higashi-ku, Fukuoka 812,
More informationFinal. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.
CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers
More informationby the size of state/action space (and by the trial length which maybeproportional to this). Lin's oine Q() (993) creates an action-replay set of expe
Speeding up Q()-learning In Proceedings of the Tenth European Conference on Machine Learning (ECML'98) Marco Wiering and Jurgen Schmidhuber IDSIA, Corso Elvezia 36 CH-69-Lugano, Switzerland Abstract. Q()-learning
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationCS 4100 // artificial intelligence. Recap/midterm review!
CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationCombine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning
Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More information