agent1 c c c c c c c c c c c agent2 c agent3 c c c c c c c c c c c indirect reward an episode a detour indirect reward an episode direct reward

Size: px
Start display at page:

Download "agent1 c c c c c c c c c c c agent2 c agent3 c c c c c c c c c c c indirect reward an episode a detour indirect reward an episode direct reward"

Transcription

1 Rationality of Reward Sharing in Multi-agent Reinforcement Learning Kazuteru Miyazaki and Shigenobu Kobayashi Tokyo Institute of Technology, Department of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, 4259, Nagatsuta, Midori-ku, Yokohama, 22{852 JAPAN Abstract. In multi-agent reinforcement learning systems, it is important to share a reward among all agents. We focus on the Rationality Theorem of Prot Sharing [5] and analyze how to share a reward among all prot sharing agents. When an agent gets a direct reward R (R > ), an indirect reward R ( ) is given to the other agents. We have derived the necessary and sucient condition to preserve the rationality as follows; < M M W ( ( M )W )(n )L ; where M and L are the maximum number of conicting all rules and rational rules in the same sensory input, W and W are the maximum episode length of a direct and an indirect-reward agents, and n is the number of agents. This theory is derived by avoiding the least desirable situation whose expected reward per an action is zero. Therefore, if we use this theorem, we can experience several ecient aspects of reward sharing. Through numerical examples, we conrm the eectiveness of this theorem. Introduction To achieve cooperation in multi-agent systems is a very desirable goal. In recent years, the bottom-up approach to multi-agent systems,which contrast remarkably with the top-down approach in DAI(Distributed Articial Intelligence), has prevailed. Recently, an approach to realize cooperation by reinforcement learning is notable. There is much literature [2, 9,, 8, 7,, 2] on multi-agent reinforcement learning. Q-learning [] and Classier System [4] are used in [2, 9, 7] and [, 8], respectively. Previous works [, 2] compare Prot Sharing [] with Q- learning in the pursuit problem [] and the cranes control problem [2]. These papers [, 2] claim that Prot Sharing is suitable for multi-agent reinforcement learning systems. In multi-agent environments where there is no negative reward, it is important to share a reward among all agents. Conventional work has used ad hoc sharing schemes. Though reward sharing may contribute to improve learning speeds and qualities, it is possible to damage system behavior. Especially, it

2 is important to preserve the rationality condition that expected reward per an action is larger than zero. In this paper, we aim to preserve the rationality condition in multi-agent environments where there is no negative reward. We focus on the Rationality Theorem of Prot Sharing [5] and analyze how to share a reward among all prot sharing agents. We show the necessary and sucient condition to preserve the rationality condition in multi-agent prot sharing systems. If we use this theorem, we can experience several ecient aspects of reward sharing without the least desirable situation where expected reward per an action is zero. Section 2 describes the problem, the method and notations. Section presents the necessary and sucient condition to preserve the rationality condition in multi-agent prot sharing systems. Section 4 shows numerical examples to understand the theorem. Section 5 is conclusions. 2 The Domain 2. Problem Formulation Consider n (n > ) agents in an unknown environment. At each discrete time step, agent i (i = ; 2; :::; n) is selected from n agents based on the selection probabilities P i (P i > ; i= n P i = ), and it senses the environment and performs an action. The agent senses a set of discrete attribute-value pairs and performs an action in M discrete varieties. We denote agent i's sensory inputs as x i ; y i ; and its actions as a i ; b i ;. A sensory input and action pair are called a rule. We denote a rule `if x then a' as! xa. The function that maps sensory inputs to actions is called a policy. We call a policy rational if and only if expected reward per an action is larger than zero. The policy that maximizes the expected reward per an action is called an optimal policy. When the n th agent ( < n n) has a special sensory input on condition that (n ) agents have special sensory inputs at some time step, the n th agent gets a direct reward R (R > ) and the other (n ) agents get an indirect reward R ( ). We call the n th agent the direct-reward agent and the other (n ) agents indirect-reward agents. We do not have any information about the n and the special sensory input. Furthermore, nobody (including reward designers) knows whether (n ) agents except for the n th agent are important or not. A set of n agents that are necessary for getting a direct reward is called the goal-agent set. In order to preserve the rationality condition that expected reward per an action is larger than zero, all agents in a goal-agent set must learn a rational policy. We show an example of direct and indirect rewards in a pursuit problem (Fig.). There are hunter agents (H,H,...,H5) and one prey agent (Pry). When H moves down (Fig.a) and 4 hunter agents surround the prey agent as shown in gure b, the direct reward is given to H and the indirect reward is given to the other agents (H,H2,...,H5). The number of agents that the indirect reward is given to is 5 (= ) because we have no information about the n (= 4) and

3 H4 H H a) b) Pry H2 H5 H H H Pry H2 H5 H4 H Fig.. The pursuit problem to explain direct and indirect rewards. the special sensory input. Traditionally, in this case, the direct reward is given to H,H,H2 and H. It means that we know what agents are important to catch the prey. However, we cannot always have such information (for example, H5 may be important). Therefore, our formulation is more general than those that have been used previously. In general, we can consider another multi-agent environments where several agents sense the environemnt and perform their actions asynchronously. In this case, there are several problem, for example, how to set reward fuction and how to decide priority of their actions. Therefore, we have taken more basic environments where an agent senses the environment and performs its action at each discrete time step. Furthermore, we do not take negative rewards and do not use reward values. Though such rewards are given to learning agents in most reinforcemnet learning systems, it is dicult to decide appropriate reward values and negative rewards. Therefore, we have regarded a reward as a good signal only. 2.2 Prot Sharing in Multi-agent Environments The purpose of this paper is to guarantee the rationality of Prot Sharing (PS) [] in the multi-agent environments discussed above. When a reward is given to an agent, PS reinforces rules on an episode, that is a sequence of rules selected between rewards, at once. In multi-agent environments, an episode is interpreted by each agent. For example, when agents select the rule sequence (! x a,! x2 a 2,! y2 a 2,! z a,! x2 a 2,! y a,! y2 b 2,! z2 b 2 and! x b ) (Fig.2) and have special sensory inputs (for getting a reward), it contains the episode ( xa!! ya), ( x2a2!! y2a2! x2a2y2b2!! z2b2) and (! za! xb) for agent, 2 and, respectively (Fig.). In this case, agent gets a direct reward and the other agent get an indirect reward. We call a subsequence of an episode a detour when the sensory input of the rst selecting rule and the sensory output of the last selecting rule are the same though both rules are dierent. For example, the episode (! x2 a 2! y2 a 2! x2 a 2

4 agent agent2 agent X X2 X a y X2 X a2 y y2 X a2 X Z2 Z a X X2 Z y a2 X2 Z Z y2 Z Xi, yi, Zi ; sensory inputs of agent i ai, bi ; actions of agent i a b2 Z Z X b2 Z2 Z2 X2 Z b X X reward Fig. 2. A rule sequence when agents select an action in some order. agent ccccccccccc a detour agent2 c a2 X a y a X2 a2 y2 X2 a2 y2 Z2 an episode an episode b2 b2 indirect reward indirect reward agent ccccccccccc Z a X an episode b direct reward Xi, yi, Zi ; sensory inputs of agent i ai, bi ; actions of agent i Fig.. Three episodes and one detour in gure 2.! y 2 b 2! z 2 b 2 ) of agent 2 contains the detour (! y 2 a 2! x 2 a 2 ) (Fig.). The rules that always exist on a detour do not contribute to get a direct reward. We call a rule ineective if and only if it always exists on a detour. Otherwise, a rule is called eective. For example, in the detour (! y 2 a 2 ; x! 2 a 2 ),! y 2 a 2 is an ineective rule and! x 2 a 2 is an eective rule because! x 2 a 2 is not exist on a detour in the rst sensory input of the episode (! x 2 a 2! y 2 a 2 x! 2 a 2 y! 2 b 2! z 2 b 2 ). We call a function that shares a reward among rules on an episode a reinforcement function. The term f i denotes a reinforcement value for the rule selected at i step before a reward is acquired. The weight S! of rule! r i is reinforced by ri S! = S ri! + f i for an episode (! r Wa! r i! r! r ) where W a is the length ri of an episode called reinforcement interval. When a reward f is given to the agent, we use the following reinforcement function that satises the Rationality

5 Theorem of PS [5], f n = M f n; n = ; 2; ; W a : () where,(f,w a ) is (R,W ) for the direct-reward agent and (R,W (W W)) for indirect-reward agents. For example, in gure, the weight of! x a and! y a are reinforced by S! = x a S! + x a M R and S! = y a S! +( y a M )2 R, respectively. 2. Properties of the Target Environments The paper [2] claims that two main problems should be considered in multi-agent reinforcement learning systems. One is perceptual aliasing problem [2] which is due to the agent's sensory limitation. The other is uncertainty of state transition problem which is due to the concurrent learning among the agents [8, ]. We can treat these problems by two confusions []. We call indistinction of state values a type confusion. Figure 4a is an example of the type confusion. In this example, the state value (v) is the minimum a) b) v(a)=2 2 S irrational rational a 2 v()=5 5 step S a b b 4 v(b)=8 v(4)= 4 ; sensory input ; rule ; reward Fig. 4. a)an example of the type confusion. b)an example of the type 2 confusion. step to a reward. The value of state a and b are 2 and 8, respectively. Though state a and b are dierent states, the agent senses them as the same sensory input (state ). If the agent takes state a and b equally, the value of state becomes 5 (= ). Therefore the value of state is higher than the value of state 4 (it is ). If the agent uses state values, it would like to move left in state. However the agent should move right in state. It means that the agent learns the irrational policy where it only transits between state b and. We call indistinction of rational and irrational rules a type 2 confusion. Figure 4b is an example of the type 2 confusion. Though the action to move up in state a is irrational, it is rational in state b. Since the agent senses state a and b as the same sensory input (state ), the action to move up in state is

6 type 2 confusion non exist type confusion non exist class PS QL class 2 PS QL ~ class PS ~ QL ~ Fig. 5. Three classes in multi-agent environments. regarded as rational. If the agent learns the action to move right in state S, it takes the irrational policy that only transits between state a and 2. In general, if there is a type 2 confusion in some sensory input, there is a type confusion in it. By these confusions, we can classify multi-agent environments into three classes as shown in gure 5. Markov Decision Processes (MDPs), that are treated by many reinforcement learning systems, belong to class. Q- learning (QL) [], that guarantees the acquisition of an optimal policy in MDPs, is deceived by the type confusion since it uses state values to make a policy. PS is not deceived by the confusion since it does not use state values. On the other hand, reinforcement learning systems that use the weight (including QL and PS) are deceived by the type 2 confusion. The Rationality Theorem of PS [5] guarantees the acquisition of a rational policy in the class where there is no type 2 confusion. Since we use the theorem, we must assume the class. Remark that the class contains a part of two main problems (perceptual aliasing and uncertainty of state transition) in multi-agent reinforcement learning systems. For example, the environment where positive state transition probabilities do not change zero does not have any type 2 confusion, even if there are perceptual aliasing and uncertainty of state transition problems in the environment. Therefore, our target environments are meaningful as multi-agent environments. In the next section, we extend the Rationality Theorem of PS to the multi-agent environments. Rationality Theorem in Multi-agent Reinforcement Learning. The Basic Idea In this section, we derive the necessary and sucient condition to preserve the rationality condition in the multi-agent prot sharing systems discussed at previous section. We call eective rules that will be learned by the Rationality Theorem of PS rational rules, and the others irrational rules. We show the relationship

7 rational rules irrational rules = effective rules ineffective rules > Fig.. The relationship of rational, irrational, eective and ineective rules. of these rules in gure. Irrational rules should not be reinforced when they conict with rational rules. When >, some irrational rule might be judged to be an eective rule. Therefore, it is important to suppress all irrational rules in eective rules. In order to preserve the rationality condition, all irrational rules in a goalagent set must be suppressed. On the other hand, if a goal-agent set is constructed by the agents that all irrational rules have been suppressed, we can preserve the rationality condition. Therefore, we derive the necessary and suf- cient condition about the range of to suppress all irrational rules in some goal-agent set. First, we characterize a conict structure where it is the most dicult to suppress irrational rules. For two conict structures A and B, we say A is more dicult than B when the range of that can suppress any irrational rule of A is included in B. Second, we derive a necessary and sucient condition about the range of to suppress any irrational rule for the most dicult conict structure. Last, it is extended to any conict structure..2 Proposal of the Rationality Theorem in Multi-agent Reinforcement Learning Lemma The most dicult conict structure. The most dicult conict structure has only one irrational rule with a self-loop. Proof is shown in Appendix A. Figure 7 is the most dicult conict structure where only one irrational rule with a self-loop conicts with L rational rules. Lemma 2 Suppressing only one irrational rule with a self-loop. Only one irrational rule with a self-loop in some goal-agent set can be suppressed if and only if < M MW ( ( M ) W )(n )L ; (2) where M is the maximum number of conicting rules in the same sensory input, L is the maximum number of conicting rational rules, W is the maximum episode length of a direct-reward agent, W is the reinforcement interval

8 L... senory input at t+ senory input at t rational rule irrational rule Fig. 7. The most dicult conict structure. of indirect-reward agents and n is the number of agents. Proof is shown in Appendix B. By using the law of transitivity, the following theorem is directly derived from these lemmas. Theorem Rationality theorem in multi-agent reinforcement learning. Any irrational rule in some goal-agent set can be suppressed if and only if < M MW ( ( M ) W )(n )L ; () where M is the maximum number of conicting rules in the same sensory input, L is the maximum number of conicting rational rules, W is the maximum episode length of a direct-reward agent, W is the reinforcement interval of indirect-reward agents, and n is the number of agents.. The Meaning of Theorem Theorem is derived by avoiding the least desirable situation where expected reward per an action is zero. Therefore, if we use this theorem, we can experience multiple ecient aspects of indirect rewards including improvement of learning speeds and qualities. We cannot know the number of L in general. However, in practice, we can set L = M. We cannot know the number of W in general. However, in practice, we can set = if the length of an episode is larger than W. If we set L = M and W = W, theorem is simplied as follows; < (MW )(n ) : (4)

9 a) S2 A2 G2 5% S A 7 8 8% G A G S b) S G2 A 5% S A A2 S2 G 8% A G S Fig. 8. Roulette-like environments used in numerical example. 4 Numerical Example 4. Environments Consider roulette-like environments in gure 8. There are and 4 learning agents in roulette a) and b), respectively. The initial state of agent i (A i ) is S i. The number shown in the center of both roulettes (from to 8 or ) is given to each agent as a sensory input. There are two actions for each agent; move right (2% failure) or move left (5% failure). If an action fails, the agent cannot move. There is no situation where another agent gets the same sensory input. At each discrete time step, A i is selected based on the selection probabilities P i (P i > ; i= n P i = ). (P; P; P2) is (.9,.5,.5) for roulette a), and (P; P; P2; P; P4) is (.72,.4,.4,.2) for roulette b). When A i reaches the goal i (G i ), P i sets : and P j (j = i) are modied proportionally. When A R reaches G R on condition that A i (i = R) have reached G i, the direct reward R (= :) is given to A R and the indirect rewards R are given to the other agents. When some agent gets the direct reward or A i reaches G j (j = i), all agents return to the initial state shown in gure 8. The initial weights for all rules are :. If all agents learn the policy `move right in any sensory input', it is optimal. If at least two agents learn the policy `move left in the initial state' or `move right in the initial state, and move left in the right side of the initial state', it is irrational. When an optimal policy does not have been destroyed in episodes, the learning is judged to be succeessful. We will stop the learning if agent, and 2 learn the policy `move left in the initial state' or the number of actions are larger than thousand. Initially, we set W =. If the length of an episode is larger than, we set = (see section.). From equation (4), we set < :74::: for roulette a) and < :::: for roulette b) to preserve the rationality condition.

10 learning speeds learning speeds 4.2 Results We investigate the learning qualities and speeds in both roulettes. We show them in table. Table a and b correspond to roulette a) and b), respectively. The Table. The learning qualities and speeds in roulette a) and b). a) b) learning qualities irrational optimal learning speeds Ave. S.D learning qualities irrational optimal learning speeds Ave. S.D a) 2 b) upper bound of Theorem upper bound of Theorem Fig. 9. Details of the learning speeds in roulette a) and b). learning qualities are evaluated by acquiring times of irrational or optimal policies in a thousand dierent trials where random seeds are changing. The learning speeds are evaluated by total action numbers to learn a thousand optimal

11 policies. Figure 9a and 9b are details of learning speeds in roulette a) and b), respectively. Though theorem satises the rationality, it does not guarantee the optimality. However, in both roulettes, the optimal policy always has been learned beyond the range of theorem. In roulette a), = : makes the learning speed the best (Tab.a, Fig.9a). On the other hand, if we set :4, there is a case that irrational policies have been learned. For example, consider the case that A ; A and A 2 in roulette a) get three rule sequences in gure. In this case, if we set = :, A ; A and A 2 A A A reward Si =.5 =., Si Si, Si A A A reward A A A reward total A A A < > Good!! No Good Fig.. An example of rule sequences in roulette a). approach to G 2 ; G and G, respectively. If we set < :74:::, such irrational policies do not have been learned. Furthermore, we have improved the learning speeds. Though it is possible to improve the learning speeds beyond the range of theorem, we should preserve theorem to guarantee the rationality in all environments.

12 In roulette b), A cannot learn anything because there is no G. Therefore, if we set =, the optimal policy does not have been learned (Tab.b). In this case, we should use the indirect reward. Table b and gure 9b show that = :2 makes the learning speed the best. On the other hand, if we set :, there is a case that irrational policies have been learned. It is an important property of the indirect reward that the learning qualities exceed those of the case of =. Though theorem only guarantees the rationality, numerical examples show that it is possible to improve the learning speeds and qualities. 5 Conclusions In most multi-agent reinforcement learning systems, reinforcement learning methods for single-agent systems are used. Though it is important to share a reward among all agents in multi-agent reinforcement learning systems, conventional work has used ad hoc sharing schemes. In this paper, we focus on the Rationality Theorem of Prot Sharing and analyze how to share a reward among all prot sharing agents. We show the necessary and sucient condition to preserve the rationality condition in multiagent reinforcement learning systems. If we use this theorem, we can experience multiple ecient aspects of reward sharing, including improvement of learning speeds and qualities, without the least desirable situation where expected reward per an action is zero. Our future projects include : ) to analyze the improvement eect of learning speeds and qualities, 2) to extend to other elds of reinforcement learning, and ) to nd ecient real world applications. Appendix A Proof of Lemma a (b=,c=) b (b=2,c=) c (b=2,c=2) d (b=2,c=2) e (b=,c=) f (b=,c=2) g (b=,c=) sensory input at t+ sensory input at t rule Fig.. Conict structures used in proof. Reinforcement of an irrational rule makes it dicult to preserve the rationality condition under any. Therefore, the diculty of a conict structure is monotonic to the number of reinforcements for irrational rules. We enumerate conict

13 structures according to the branching factor b (the number of state-transitions in the same sensory input), the conict factor c (the number of conicting rules in it), and the count of reinforcements for irrational rules. Though we set L =, we can extend to any number easily. b = : It is clearly not dicult since there are no conicts (Fig.a). b = 2 : When there are no conicts (Fig.b), it is the same as b =. We divide structures of c = 2 into two subcases. One contains a self-loop (Fig.c), and the other does not (Fig.d). In the case of gure c, there is a possibility that the self-loop rule is selected repeatedly, while the non-self-loop rule is selected once at maximum. Therefore, if the self-loop rule is irrational, it will be reinforced more than the irrational rule of gure d. b : When there are no conicts (Fig.e), it is the same as b =. Consider the structure of c = 2 (Fig.f). Although the most dicult case is that the conict structure has an irrational rule as a self-loop, even such a structure is less dicult than gure c. Considering the structure of c = (Fig.g), two conict rules are irrational because of L =. Therefore, an expected number of reinforcements for one irrational rule is less than of gure f. Similarly, conict structures of b > are less dicult than gure c. From the above discussion, it is concluded that the most dicult conict structure is gure c. Q.E.D. B Proof of Lemma 2 r k r k r2 k... r L k r k r k r2 k... EEE r k L r k n- r k r k n- 2 n-... r n- L k Fig. 2. The most dicult multi-agent structure. For any reinforcement interval k (k = ; ; :::; W ) in some goal-agent set, we show that there is j (j = ; 2; :::; L) satisfying the following condition, S k ij > Sk i ; (5) where Sij k is the weight of jth rational rule (rk ij ) in agent i (Fig.2). First, we consider the ratio of the selection number of rij k to ri. k When n = and L rational rules for each agent are selected by all agents in turn, the minimum of the ratio is maximized (Tab.2). In this case, the following ratio holds (Fig.), r k ij : r k i = : (n )L ()

14 agent number c Table 2. An example of rule sequence for all agents on some k. If `X changes to O' or `O2 changes to O' in this table, the learning in the agent that can select the changing rule occurs more easily. c ~ ~ ; rational rule ~ ; irrational L ~ L the acquisition number of rewards sequence 4 4 : : 4 : 4 sequence2 EEE : : : Fig.. An example of rule sequence. Sequence is more easily learned than sequence 2. Second, we consider weights given to rij k and ri. k When the agent that gets the direct reward senses no similar sensory input in W, the weight given to rij k R is minimized. It is in k = W. On the other hand, when agents that get MW the indirect reward sense the same sensory input in W, the weight given to r k i is maximized. It is R M M ( ( M )W ) in W W. Therefore, it is necessary for satisfying condition (5) to hold the following condition, that is, R > R M MW < M ( ( M )W )(n )L; (7) M M W ( ( M )W )(n )L : (8)

15 It is clearly the sucient condition. Q.E.D. References. Arai, S., Miyazaki, K., and Kobayashi, S.: Generating Cooperative Behavior by Multi-Agent Reinforcement Learning, Proc. of the th European Workshop on Learning Robots, pp.4-57 (997). 2. Arai, S., Miyazaki, K., and Kobayashi, S.: Cranes Control Using Multi-agent Reinforcement Learning, International Conference on Intelligent Autonomous System 5, pp.5-42 (998).. Grefenstette, J. J.: Credit Assignment in Rule Discovery Systems Based on Genetic Algorithms, Machine Learning Vol., pp (988). 4. Holland, J. H.: Escaping Brittleness: The Possibilities of General-Purpose Learning Algorithms Applied to Parallel Rule-Based Sysems, in R.S.Michalsky et al. (eds.), Machine Learning: An Articial Intelligence Approach, Vol.2, pp Morgan Kaufman (98). 5. Miyazaki, K., Yamamura, M., and Kobayashi, S.: On the Rationality of Prot Sharing in Reinforcement Learning, Proc. of the rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp (994).. Miyazaki, K., and Kobayashi, S.: Learning Deterministic Policies in Partially Observable Markov Decision Processes, International Conference on Intelligent Autonomous System 5, pp (998). 7. Ono, N., Ikeda, O. and Rahmani, A.T. : Synthesis of Herding and Specialized Behavior by Modular Q-learning Animats, Proc. of the ALIFE V Poster Presentations, pp.2- (99). 8. Sen, S. and Sekaran, M. : Multiagent Coordination with Learning Classier Systems, in Weiss, G. and Sen, S.(eds.), Adaption and Learning in Multi-agent systems, Berlin, Heidelberg. Springer Verlag, pp.28-2(995). 9. Tan, M. : Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents, Proc. of the th International Conference on Machine Learning, pp.- 7 (99).. Watkins, C. J. H., and Dayan, P.: Technical note: Q-learning, Machine Learning Vol.8, pp.55-8(992).. Weiss, G. : Learning to Coordinate Actions in Multi-Agent Systems, Proc. of the th International Joint Conference on Articial Intelligence, pp.- (99). 2. Whitehead, S. D. and Balland, D. H.: Active perception and Reinforcement Learning, Proc. of the 7th International Conference on Machine Learning, pp.2-9(99). This article was processed using the LaT E X macro package with LLNCS style

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Computing the acceptability semantics. London SW7 2BZ, UK, Nicosia P.O. Box 537, Cyprus,

Computing the acceptability semantics. London SW7 2BZ, UK, Nicosia P.O. Box 537, Cyprus, Computing the acceptability semantics Francesca Toni 1 and Antonios C. Kakas 2 1 Department of Computing, Imperial College, 180 Queen's Gate, London SW7 2BZ, UK, ft@doc.ic.ac.uk 2 Department of Computer

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3]. Gradient Descent Approaches to Neural-Net-Based Solutions of the Hamilton-Jacobi-Bellman Equation Remi Munos, Leemon C. Baird and Andrew W. Moore Robotics Institute and Computer Science Department, Carnegie

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo Department of Electrical and Computer Engineering, Nagoya Institute of Technology

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules. Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Rule Acquisition for Cognitive Agents by Using Estimation of Distribution Algorithms

Rule Acquisition for Cognitive Agents by Using Estimation of Distribution Algorithms Rule cquisition for Cognitive gents by Using Estimation of Distribution lgorithms Tokue Nishimura and Hisashi Handa Graduate School of Natural Science and Technology, Okayama University Okayama 700-8530,

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University

More information

agent agent agent... agent

agent agent agent... agent Towards self-organising action selection Mark Humphrys University of Cambridge, Computer Laboratory http://www.cl.cam.ac.uk/users/mh10006 Abstract Systems with multiple parallel goals (e.g. autonomous

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about: Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed

More information

The size of decision table can be understood in terms of both cardinality of A, denoted by card (A), and the number of equivalence classes of IND (A),

The size of decision table can be understood in terms of both cardinality of A, denoted by card (A), and the number of equivalence classes of IND (A), Attribute Set Decomposition of Decision Tables Dominik Slezak Warsaw University Banacha 2, 02-097 Warsaw Phone: +48 (22) 658-34-49 Fax: +48 (22) 658-34-48 Email: slezak@alfa.mimuw.edu.pl ABSTRACT: Approach

More information

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient Efficient Non-Linear Control by Combining Q-learning with Local Linear Controllers Hajime Kimura Λ Tokyo Institute of Technology gen@fe.dis.titech.ac.jp Shigenobu Kobayashi Tokyo Institute of Technology

More information

Wojciech Penczek. Polish Academy of Sciences, Warsaw, Poland. and. Institute of Informatics, Siedlce, Poland.

Wojciech Penczek. Polish Academy of Sciences, Warsaw, Poland. and. Institute of Informatics, Siedlce, Poland. A local approach to modal logic for multi-agent systems? Wojciech Penczek 1 Institute of Computer Science Polish Academy of Sciences, Warsaw, Poland and 2 Akademia Podlaska Institute of Informatics, Siedlce,

More information

CS343 Artificial Intelligence

CS343 Artificial Intelligence CS343 Artificial Intelligence Prof: Department of Computer Science The University of Texas at Austin Good Afternoon, Colleagues Good Afternoon, Colleagues Are there any questions? Logistics Problems with

More information

Improving Coordination via Emergent Communication in Cooperative Multiagent Systems: A Genetic Network Programming Approach

Improving Coordination via Emergent Communication in Cooperative Multiagent Systems: A Genetic Network Programming Approach Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Improving Coordination via Emergent Communication in Cooperative Multiagent Systems:

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Evolving Behavioral Strategies. in Predators and Prey. Thomas Haynes and Sandip Sen. Department of Mathematical & Computer Sciences

Evolving Behavioral Strategies. in Predators and Prey. Thomas Haynes and Sandip Sen. Department of Mathematical & Computer Sciences Evolving Behavioral Strategies in Predators and Prey Thomas Haynes and Sandip Sen Department of Mathematical & Computer Sciences The University of Tulsa e{mail: [haynes,sandip]@euler.mcs.utulsa.edu? Abstract.

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

A Neuro-Fuzzy Scheme for Integrated Input Fuzzy Set Selection and Optimal Fuzzy Rule Generation for Classification

A Neuro-Fuzzy Scheme for Integrated Input Fuzzy Set Selection and Optimal Fuzzy Rule Generation for Classification A Neuro-Fuzzy Scheme for Integrated Input Fuzzy Set Selection and Optimal Fuzzy Rule Generation for Classification Santanu Sen 1 and Tandra Pal 2 1 Tejas Networks India Ltd., Bangalore - 560078, India

More information

Comparison of Evolutionary Methods for. Smoother Evolution. 2-4 Hikaridai, Seika-cho. Soraku-gun, Kyoto , JAPAN

Comparison of Evolutionary Methods for. Smoother Evolution. 2-4 Hikaridai, Seika-cho. Soraku-gun, Kyoto , JAPAN Comparison of Evolutionary Methods for Smoother Evolution Tomofumi Hikage, Hitoshi Hemmi, and Katsunori Shimohara NTT Communication Science Laboratories 2-4 Hikaridai, Seika-cho Soraku-gun, Kyoto 619-0237,

More information

QUICR-learning for Multi-Agent Coordination

QUICR-learning for Multi-Agent Coordination QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop

More information

Perceptron. (c) Marcin Sydow. Summary. Perceptron

Perceptron. (c) Marcin Sydow. Summary. Perceptron Topics covered by this lecture: Neuron and its properties Mathematical model of neuron: as a classier ' Learning Rule (Delta Rule) Neuron Human neural system has been a natural source of inspiration for

More information

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525 Markov Models and Reinforcement Learning Stephen G. Ware CSCI 4525 / 5525 Camera Vacuum World (CVW) 2 discrete rooms with cameras that detect dirt. A mobile robot with a vacuum. The goal is to ensure both

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

A Preference Semantics. for Ground Nonmonotonic Modal Logics. logics, a family of nonmonotonic modal logics obtained by means of a

A Preference Semantics. for Ground Nonmonotonic Modal Logics. logics, a family of nonmonotonic modal logics obtained by means of a A Preference Semantics for Ground Nonmonotonic Modal Logics Daniele Nardi and Riccardo Rosati Dipartimento di Informatica e Sistemistica, Universita di Roma \La Sapienza", Via Salaria 113, I-00198 Roma,

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Abstract: Q-learning as well as other learning paradigms depend strongly on the representation

Abstract: Q-learning as well as other learning paradigms depend strongly on the representation Ecient Q-Learning by Division of Labor Michael Herrmann RIKEN, Lab. f. Information Representation 2-1 Hirosawa, Wakoshi 351-01, Japan michael@sponge.riken.go.jp Ralf Der Universitat Leipzig, Inst. f. Informatik

More information

0 o 1 i B C D 0/1 0/ /1

0 o 1 i B C D 0/1 0/ /1 A Comparison of Dominance Mechanisms and Simple Mutation on Non-Stationary Problems Jonathan Lewis,? Emma Hart, Graeme Ritchie Department of Articial Intelligence, University of Edinburgh, Edinburgh EH

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Learning Without State-Estimation in Partially Observable. Tommi Jaakkola. Department of Brain and Cognitive Sciences (E10)

Learning Without State-Estimation in Partially Observable. Tommi Jaakkola. Department of Brain and Cognitive Sciences (E10) Learning Without State-Estimation in Partially Observable Markovian Decision Processes Satinder P. Singh singh@psyche.mit.edu Tommi Jaakkola tommi@psyche.mit.edu Michael I. Jordan jordan@psyche.mit.edu

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

Multiagent Systems Motivation. Multiagent Systems Terminology Basics Shapley value Representation. 10.

Multiagent Systems Motivation. Multiagent Systems Terminology Basics Shapley value Representation. 10. Multiagent Systems July 2, 2014 10. Coalition Formation Multiagent Systems 10. Coalition Formation B. Nebel, C. Becker-Asano, S. Wöl Albert-Ludwigs-Universität Freiburg July 2, 2014 10.1 Motivation 10.2

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Φ 12 Φ 21 Φ 22 Φ L2 Φ LL Φ L1

Φ 12 Φ 21 Φ 22 Φ L2 Φ LL Φ L1 Self-organiing Symbol Acquisition and Motion Generation based on Dynamics-based Information Processing System Masafumi Okada, Daisuke Nakamura 2 and Yoshihiko Nakamura 2 Dept. of Mechanical Science and

More information

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement

More information

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33 Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04- June, 2004 Department of Computer Science University of Massachusetts

More information

Zero-Sum Stochastic Games An algorithmic review

Zero-Sum Stochastic Games An algorithmic review Zero-Sum Stochastic Games An algorithmic review Emmanuel Hyon LIP6/Paris Nanterre with N Yemele and L Perrotin Rosario November 2017 Final Meeting Dygame Dygame Project Amstic Outline 1 Introduction Static

More information

ξ, ξ ξ Data number ξ 1

ξ, ξ ξ Data number ξ 1 Polynomial Design of Dynamics-based Information Processing System Masafumi OKADA and Yoshihiko NAKAMURA Univ. of Tokyo, 7-- Hongo Bunkyo-ku, JAPAN Abstract. For the development of the intelligent robot

More information

Learning Communication for Multi-agent Systems

Learning Communication for Multi-agent Systems Learning Communication for Multi-agent Systems C. Lee Giles 1,2 and Kam-Chuen Jim 2 1 School of Information Sciences & Technology and Computer Science and Engineering The Pennsylvania State University

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Statistical Analysis of Backtracking on. Inconsistent CSPs? Irina Rish and Daniel Frost.

Statistical Analysis of Backtracking on. Inconsistent CSPs? Irina Rish and Daniel Frost. Statistical Analysis of Backtracking on Inconsistent CSPs Irina Rish and Daniel Frost Department of Information and Computer Science University of California, Irvine, CA 92697-3425 firinar,frostg@ics.uci.edu

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Learning Delayed Influence of Dynamical Systems From Interpretation Transition

Learning Delayed Influence of Dynamical Systems From Interpretation Transition Learning Delayed Influence of Dynamical Systems From Interpretation Transition Tony Ribeiro 1, Morgan Magnin 2,3, and Katsumi Inoue 1,2 1 The Graduate University for Advanced Studies (Sokendai), 2-1-2

More information

A Cooperation Strategy for Decision-making Agents

A Cooperation Strategy for Decision-making Agents A Cooperation Strategy for Decision-making Agents Eduardo Camponogara Institute for Complex Engineered Systems Carnegie Mellon University Pittsburgh, PA 523-389 camponog+@cs.cmu.edu Abstract The style

More information

Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics

Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Best-Response Multiagent Learning in Non-Stationary Environments

Best-Response Multiagent Learning in Non-Stationary Environments Best-Response Multiagent Learning in Non-Stationary Environments Michael Weinberg Jeffrey S. Rosenschein School of Engineering and Computer Science Heew University Jerusalem, Israel fmwmw,jeffg@cs.huji.ac.il

More information

Dynamics in the dynamic walk of a quadruped robot. Hiroshi Kimura. University of Tohoku. Aramaki Aoba, Aoba-ku, Sendai 980, Japan

Dynamics in the dynamic walk of a quadruped robot. Hiroshi Kimura. University of Tohoku. Aramaki Aoba, Aoba-ku, Sendai 980, Japan Dynamics in the dynamic walk of a quadruped robot Hiroshi Kimura Department of Mechanical Engineering II University of Tohoku Aramaki Aoba, Aoba-ku, Sendai 980, Japan Isao Shimoyama and Hirofumi Miura

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

Dempster's Rule of Combination is. #P -complete. Pekka Orponen. Department of Computer Science, University of Helsinki

Dempster's Rule of Combination is. #P -complete. Pekka Orponen. Department of Computer Science, University of Helsinki Dempster's Rule of Combination is #P -complete Pekka Orponen Department of Computer Science, University of Helsinki eollisuuskatu 23, SF{00510 Helsinki, Finland Abstract We consider the complexity of combining

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Laboratory for Computer Science. Abstract. We examine the problem of nding a good expert from a sequence of experts. Each expert

Laboratory for Computer Science. Abstract. We examine the problem of nding a good expert from a sequence of experts. Each expert Picking the Best Expert from a Sequence Ruth Bergman Laboratory for Articial Intelligence Massachusetts Institute of Technology Cambridge, MA 0239 Ronald L. Rivest y Laboratory for Computer Science Massachusetts

More information

CS 4649/7649 Robot Intelligence: Planning

CS 4649/7649 Robot Intelligence: Planning CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

Georey J. Gordon. Carnegie Mellon University. Pittsburgh PA Bellman-Ford single-destination shortest paths algorithm

Georey J. Gordon. Carnegie Mellon University. Pittsburgh PA Bellman-Ford single-destination shortest paths algorithm Stable Function Approximation in Dynamic Programming Georey J. Gordon Computer Science Department Carnegie Mellon University Pittsburgh PA 53 ggordon@cs.cmu.edu Abstract The success of reinforcement learning

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

RECURSION EQUATION FOR

RECURSION EQUATION FOR Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN

More information

Shigetaka Fujita. Rokkodai, Nada, Kobe 657, Japan. Haruhiko Nishimura. Yashiro-cho, Kato-gun, Hyogo , Japan. Abstract

Shigetaka Fujita. Rokkodai, Nada, Kobe 657, Japan. Haruhiko Nishimura. Yashiro-cho, Kato-gun, Hyogo , Japan. Abstract KOBE-TH-94-07 HUIS-94-03 November 1994 An Evolutionary Approach to Associative Memory in Recurrent Neural Networks Shigetaka Fujita Graduate School of Science and Technology Kobe University Rokkodai, Nada,

More information

1 Introduction Duality transformations have provided a useful tool for investigating many theories both in the continuum and on the lattice. The term

1 Introduction Duality transformations have provided a useful tool for investigating many theories both in the continuum and on the lattice. The term SWAT/102 U(1) Lattice Gauge theory and its Dual P. K. Coyle a, I. G. Halliday b and P. Suranyi c a Racah Institute of Physics, Hebrew University of Jerusalem, Jerusalem 91904, Israel. b Department ofphysics,

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Abstract. This paper discusses polynomial-time reductions from Hamiltonian Circuit (HC),

Abstract. This paper discusses polynomial-time reductions from Hamiltonian Circuit (HC), SAT-Variable Complexity of Hard Combinatorial Problems Kazuo Iwama and Shuichi Miyazaki Department of Computer Science and Communication Engineering Kyushu University, Hakozaki, Higashi-ku, Fukuoka 812,

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

by the size of state/action space (and by the trial length which maybeproportional to this). Lin's oine Q() (993) creates an action-replay set of expe

by the size of state/action space (and by the trial length which maybeproportional to this). Lin's oine Q() (993) creates an action-replay set of expe Speeding up Q()-learning In Proceedings of the Tenth European Conference on Machine Learning (ECML'98) Marco Wiering and Jurgen Schmidhuber IDSIA, Corso Elvezia 36 CH-69-Lugano, Switzerland Abstract. Q()-learning

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information