agent1 c c c c c c c c c c c agent2 c agent3 c c c c c c c c c c c indirect reward an episode a detour indirect reward an episode direct reward

Size: px

Start display at page:

Download "agent1 c c c c c c c c c c c agent2 c agent3 c c c c c c c c c c c indirect reward an episode a detour indirect reward an episode direct reward"

Ralf Fowler
6 years ago
Views:

1 Rationality of Reward Sharing in Multi-agent Reinforcement Learning Kazuteru Miyazaki and Shigenobu Kobayashi Tokyo Institute of Technology, Department of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, 4259, Nagatsuta, Midori-ku, Yokohama, 22{852 JAPAN Abstract. In multi-agent reinforcement learning systems, it is important to share a reward among all agents. We focus on the Rationality Theorem of Prot Sharing [5] and analyze how to share a reward among all prot sharing agents. When an agent gets a direct reward R (R > ), an indirect reward R ( ) is given to the other agents. We have derived the necessary and sucient condition to preserve the rationality as follows; < M M W ( ( M )W )(n )L ; where M and L are the maximum number of conicting all rules and rational rules in the same sensory input, W and W are the maximum episode length of a direct and an indirect-reward agents, and n is the number of agents. This theory is derived by avoiding the least desirable situation whose expected reward per an action is zero. Therefore, if we use this theorem, we can experience several ecient aspects of reward sharing. Through numerical examples, we conrm the eectiveness of this theorem. Introduction To achieve cooperation in multi-agent systems is a very desirable goal. In recent years, the bottom-up approach to multi-agent systems,which contrast remarkably with the top-down approach in DAI(Distributed Articial Intelligence), has prevailed. Recently, an approach to realize cooperation by reinforcement learning is notable. There is much literature [2, 9,, 8, 7,, 2] on multi-agent reinforcement learning. Q-learning [] and Classier System [4] are used in [2, 9, 7] and [, 8], respectively. Previous works [, 2] compare Prot Sharing [] with Q- learning in the pursuit problem [] and the cranes control problem [2]. These papers [, 2] claim that Prot Sharing is suitable for multi-agent reinforcement learning systems. In multi-agent environments where there is no negative reward, it is important to share a reward among all agents. Conventional work has used ad hoc sharing schemes. Though reward sharing may contribute to improve learning speeds and qualities, it is possible to damage system behavior. Especially, it

2 is important to preserve the rationality condition that expected reward per an action is larger than zero. In this paper, we aim to preserve the rationality condition in multi-agent environments where there is no negative reward. We focus on the Rationality Theorem of Prot Sharing [5] and analyze how to share a reward among all prot sharing agents. We show the necessary and sucient condition to preserve the rationality condition in multi-agent prot sharing systems. If we use this theorem, we can experience several ecient aspects of reward sharing without the least desirable situation where expected reward per an action is zero. Section 2 describes the problem, the method and notations. Section presents the necessary and sucient condition to preserve the rationality condition in multi-agent prot sharing systems. Section 4 shows numerical examples to understand the theorem. Section 5 is conclusions. 2 The Domain 2. Problem Formulation Consider n (n > ) agents in an unknown environment. At each discrete time step, agent i (i = ; 2; :::; n) is selected from n agents based on the selection probabilities P i (P i > ; i= n P i = ), and it senses the environment and performs an action. The agent senses a set of discrete attribute-value pairs and performs an action in M discrete varieties. We denote agent i's sensory inputs as x i ; y i ; and its actions as a i ; b i ;. A sensory input and action pair are called a rule. We denote a rule `if x then a' as! xa. The function that maps sensory inputs to actions is called a policy. We call a policy rational if and only if expected reward per an action is larger than zero. The policy that maximizes the expected reward per an action is called an optimal policy. When the n th agent ( < n n) has a special sensory input on condition that (n ) agents have special sensory inputs at some time step, the n th agent gets a direct reward R (R > ) and the other (n ) agents get an indirect reward R ( ). We call the n th agent the direct-reward agent and the other (n ) agents indirect-reward agents. We do not have any information about the n and the special sensory input. Furthermore, nobody (including reward designers) knows whether (n ) agents except for the n th agent are important or not. A set of n agents that are necessary for getting a direct reward is called the goal-agent set. In order to preserve the rationality condition that expected reward per an action is larger than zero, all agents in a goal-agent set must learn a rational policy. We show an example of direct and indirect rewards in a pursuit problem (Fig.). There are hunter agents (H,H,...,H5) and one prey agent (Pry). When H moves down (Fig.a) and 4 hunter agents surround the prey agent as shown in gure b, the direct reward is given to H and the indirect reward is given to the other agents (H,H2,...,H5). The number of agents that the indirect reward is given to is 5 (= ) because we have no information about the n (= 4) and

3 H4 H H a) b) Pry H2 H5 H H H Pry H2 H5 H4 H Fig.. The pursuit problem to explain direct and indirect rewards. the special sensory input. Traditionally, in this case, the direct reward is given to H,H,H2 and H. It means that we know what agents are important to catch the prey. However, we cannot always have such information (for example, H5 may be important). Therefore, our formulation is more general than those that have been used previously. In general, we can consider another multi-agent environments where several agents sense the environemnt and perform their actions asynchronously. In this case, there are several problem, for example, how to set reward fuction and how to decide priority of their actions. Therefore, we have taken more basic environments where an agent senses the environment and performs its action at each discrete time step. Furthermore, we do not take negative rewards and do not use reward values. Though such rewards are given to learning agents in most reinforcemnet learning systems, it is dicult to decide appropriate reward values and negative rewards. Therefore, we have regarded a reward as a good signal only. 2.2 Prot Sharing in Multi-agent Environments The purpose of this paper is to guarantee the rationality of Prot Sharing (PS) [] in the multi-agent environments discussed above. When a reward is given to an agent, PS reinforces rules on an episode, that is a sequence of rules selected between rewards, at once. In multi-agent environments, an episode is interpreted by each agent. For example, when agents select the rule sequence (! x a,! x2 a 2,! y2 a 2,! z a,! x2 a 2,! y a,! y2 b 2,! z2 b 2 and! x b ) (Fig.2) and have special sensory inputs (for getting a reward), it contains the episode ( xa!! ya), ( x2a2!! y2a2! x2a2y2b2!! z2b2) and (! za! xb) for agent, 2 and, respectively (Fig.). In this case, agent gets a direct reward and the other agent get an indirect reward. We call a subsequence of an episode a detour when the sensory input of the rst selecting rule and the sensory output of the last selecting rule are the same though both rules are dierent. For example, the episode (! x2 a 2! y2 a 2! x2 a 2

4 agent agent2 agent X X2 X a y X2 X a2 y y2 X a2 X Z2 Z a X X2 Z y a2 X2 Z Z y2 Z Xi, yi, Zi ; sensory inputs of agent i ai, bi ; actions of agent i a b2 Z Z X b2 Z2 Z2 X2 Z b X X reward Fig. 2. A rule sequence when agents select an action in some order. agent ccccccccccc a detour agent2 c a2 X a y a X2 a2 y2 X2 a2 y2 Z2 an episode an episode b2 b2 indirect reward indirect reward agent ccccccccccc Z a X an episode b direct reward Xi, yi, Zi ; sensory inputs of agent i ai, bi ; actions of agent i Fig.. Three episodes and one detour in gure 2.! y 2 b 2! z 2 b 2 ) of agent 2 contains the detour (! y 2 a 2! x 2 a 2 ) (Fig.). The rules that always exist on a detour do not contribute to get a direct reward. We call a rule ineective if and only if it always exists on a detour. Otherwise, a rule is called eective. For example, in the detour (! y 2 a 2 ; x! 2 a 2 ),! y 2 a 2 is an ineective rule and! x 2 a 2 is an eective rule because! x 2 a 2 is not exist on a detour in the rst sensory input of the episode (! x 2 a 2! y 2 a 2 x! 2 a 2 y! 2 b 2! z 2 b 2 ). We call a function that shares a reward among rules on an episode a reinforcement function. The term f i denotes a reinforcement value for the rule selected at i step before a reward is acquired. The weight S! of rule! r i is reinforced by ri S! = S ri! + f i for an episode (! r Wa! r i! r! r ) where W a is the length ri of an episode called reinforcement interval. When a reward f is given to the agent, we use the following reinforcement function that satises the Rationality

5 Theorem of PS [5], f n = M f n; n = ; 2; ; W a : () where,(f,w a ) is (R,W ) for the direct-reward agent and (R,W (W W)) for indirect-reward agents. For example, in gure, the weight of! x a and! y a are reinforced by S! = x a S! + x a M R and S! = y a S! +( y a M )2 R, respectively. 2. Properties of the Target Environments The paper [2] claims that two main problems should be considered in multi-agent reinforcement learning systems. One is perceptual aliasing problem [2] which is due to the agent's sensory limitation. The other is uncertainty of state transition problem which is due to the concurrent learning among the agents [8, ]. We can treat these problems by two confusions []. We call indistinction of state values a type confusion. Figure 4a is an example of the type confusion. In this example, the state value (v) is the minimum a) b) v(a)=2 2 S irrational rational a 2 v()=5 5 step S a b b 4 v(b)=8 v(4)= 4 ; sensory input ; rule ; reward Fig. 4. a)an example of the type confusion. b)an example of the type 2 confusion. step to a reward. The value of state a and b are 2 and 8, respectively. Though state a and b are dierent states, the agent senses them as the same sensory input (state ). If the agent takes state a and b equally, the value of state becomes 5 (= ). Therefore the value of state is higher than the value of state 4 (it is ). If the agent uses state values, it would like to move left in state. However the agent should move right in state. It means that the agent learns the irrational policy where it only transits between state b and. We call indistinction of rational and irrational rules a type 2 confusion. Figure 4b is an example of the type 2 confusion. Though the action to move up in state a is irrational, it is rational in state b. Since the agent senses state a and b as the same sensory input (state ), the action to move up in state is

6 type 2 confusion non exist type confusion non exist class PS QL class 2 PS QL ~ class PS ~ QL ~ Fig. 5. Three classes in multi-agent environments. regarded as rational. If the agent learns the action to move right in state S, it takes the irrational policy that only transits between state a and 2. In general, if there is a type 2 confusion in some sensory input, there is a type confusion in it. By these confusions, we can classify multi-agent environments into three classes as shown in gure 5. Markov Decision Processes (MDPs), that are treated by many reinforcement learning systems, belong to class. Q- learning (QL) [], that guarantees the acquisition of an optimal policy in MDPs, is deceived by the type confusion since it uses state values to make a policy. PS is not deceived by the confusion since it does not use state values. On the other hand, reinforcement learning systems that use the weight (including QL and PS) are deceived by the type 2 confusion. The Rationality Theorem of PS [5] guarantees the acquisition of a rational policy in the class where there is no type 2 confusion. Since we use the theorem, we must assume the class. Remark that the class contains a part of two main problems (perceptual aliasing and uncertainty of state transition) in multi-agent reinforcement learning systems. For example, the environment where positive state transition probabilities do not change zero does not have any type 2 confusion, even if there are perceptual aliasing and uncertainty of state transition problems in the environment. Therefore, our target environments are meaningful as multi-agent environments. In the next section, we extend the Rationality Theorem of PS to the multi-agent environments. Rationality Theorem in Multi-agent Reinforcement Learning. The Basic Idea In this section, we derive the necessary and sucient condition to preserve the rationality condition in the multi-agent prot sharing systems discussed at previous section. We call eective rules that will be learned by the Rationality Theorem of PS rational rules, and the others irrational rules. We show the relationship

7 rational rules irrational rules = effective rules ineffective rules > Fig.. The relationship of rational, irrational, eective and ineective rules. of these rules in gure. Irrational rules should not be reinforced when they conict with rational rules. When >, some irrational rule might be judged to be an eective rule. Therefore, it is important to suppress all irrational rules in eective rules. In order to preserve the rationality condition, all irrational rules in a goalagent set must be suppressed. On the other hand, if a goal-agent set is constructed by the agents that all irrational rules have been suppressed, we can preserve the rationality condition. Therefore, we derive the necessary and suf- cient condition about the range of to suppress all irrational rules in some goal-agent set. First, we characterize a conict structure where it is the most dicult to suppress irrational rules. For two conict structures A and B, we say A is more dicult than B when the range of that can suppress any irrational rule of A is included in B. Second, we derive a necessary and sucient condition about the range of to suppress any irrational rule for the most dicult conict structure. Last, it is extended to any conict structure..2 Proposal of the Rationality Theorem in Multi-agent Reinforcement Learning Lemma The most dicult conict structure. The most dicult conict structure has only one irrational rule with a self-loop. Proof is shown in Appendix A. Figure 7 is the most dicult conict structure where only one irrational rule with a self-loop conicts with L rational rules. Lemma 2 Suppressing only one irrational rule with a self-loop. Only one irrational rule with a self-loop in some goal-agent set can be suppressed if and only if < M MW ( ( M ) W )(n )L ; (2) where M is the maximum number of conicting rules in the same sensory input, L is the maximum number of conicting rational rules, W is the maximum episode length of a direct-reward agent, W is the reinforcement interval

8 L... senory input at t+ senory input at t rational rule irrational rule Fig. 7. The most dicult conict structure. of indirect-reward agents and n is the number of agents. Proof is shown in Appendix B. By using the law of transitivity, the following theorem is directly derived from these lemmas. Theorem Rationality theorem in multi-agent reinforcement learning. Any irrational rule in some goal-agent set can be suppressed if and only if < M MW ( ( M ) W )(n )L ; () where M is the maximum number of conicting rules in the same sensory input, L is the maximum number of conicting rational rules, W is the maximum episode length of a direct-reward agent, W is the reinforcement interval of indirect-reward agents, and n is the number of agents.. The Meaning of Theorem Theorem is derived by avoiding the least desirable situation where expected reward per an action is zero. Therefore, if we use this theorem, we can experience multiple ecient aspects of indirect rewards including improvement of learning speeds and qualities. We cannot know the number of L in general. However, in practice, we can set L = M. We cannot know the number of W in general. However, in practice, we can set = if the length of an episode is larger than W. If we set L = M and W = W, theorem is simplied as follows; < (MW )(n ) : (4)

9 a) S2 A2 G2 5% S A 7 8 8% G A G S b) S G2 A 5% S A A2 S2 G 8% A G S Fig. 8. Roulette-like environments used in numerical example. 4 Numerical Example 4. Environments Consider roulette-like environments in gure 8. There are and 4 learning agents in roulette a) and b), respectively. The initial state of agent i (A i ) is S i. The number shown in the center of both roulettes (from to 8 or ) is given to each agent as a sensory input. There are two actions for each agent; move right (2% failure) or move left (5% failure). If an action fails, the agent cannot move. There is no situation where another agent gets the same sensory input. At each discrete time step, A i is selected based on the selection probabilities P i (P i > ; i= n P i = ). (P; P; P2) is (.9,.5,.5) for roulette a), and (P; P; P2; P; P4) is (.72,.4,.4,.2) for roulette b). When A i reaches the goal i (G i ), P i sets : and P j (j = i) are modied proportionally. When A R reaches G R on condition that A i (i = R) have reached G i, the direct reward R (= :) is given to A R and the indirect rewards R are given to the other agents. When some agent gets the direct reward or A i reaches G j (j = i), all agents return to the initial state shown in gure 8. The initial weights for all rules are :. If all agents learn the policy `move right in any sensory input', it is optimal. If at least two agents learn the policy `move left in the initial state' or `move right in the initial state, and move left in the right side of the initial state', it is irrational. When an optimal policy does not have been destroyed in episodes, the learning is judged to be succeessful. We will stop the learning if agent, and 2 learn the policy `move left in the initial state' or the number of actions are larger than thousand. Initially, we set W =. If the length of an episode is larger than, we set = (see section.). From equation (4), we set < :74::: for roulette a) and < :::: for roulette b) to preserve the rationality condition.

10 learning speeds learning speeds 4.2 Results We investigate the learning qualities and speeds in both roulettes. We show them in table. Table a and b correspond to roulette a) and b), respectively. The Table. The learning qualities and speeds in roulette a) and b). a) b) learning qualities irrational optimal learning speeds Ave. S.D learning qualities irrational optimal learning speeds Ave. S.D a) 2 b) upper bound of Theorem upper bound of Theorem Fig. 9. Details of the learning speeds in roulette a) and b). learning qualities are evaluated by acquiring times of irrational or optimal policies in a thousand dierent trials where random seeds are changing. The learning speeds are evaluated by total action numbers to learn a thousand optimal

11 policies. Figure 9a and 9b are details of learning speeds in roulette a) and b), respectively. Though theorem satises the rationality, it does not guarantee the optimality. However, in both roulettes, the optimal policy always has been learned beyond the range of theorem. In roulette a), = : makes the learning speed the best (Tab.a, Fig.9a). On the other hand, if we set :4, there is a case that irrational policies have been learned. For example, consider the case that A ; A and A 2 in roulette a) get three rule sequences in gure. In this case, if we set = :, A ; A and A 2 A A A reward Si =.5 =., Si Si, Si A A A reward A A A reward total A A A < > Good!! No Good Fig.. An example of rule sequences in roulette a). approach to G 2 ; G and G, respectively. If we set < :74:::, such irrational policies do not have been learned. Furthermore, we have improved the learning speeds. Though it is possible to improve the learning speeds beyond the range of theorem, we should preserve theorem to guarantee the rationality in all environments.

12 In roulette b), A cannot learn anything because there is no G. Therefore, if we set =, the optimal policy does not have been learned (Tab.b). In this case, we should use the indirect reward. Table b and gure 9b show that = :2 makes the learning speed the best. On the other hand, if we set :, there is a case that irrational policies have been learned. It is an important property of the indirect reward that the learning qualities exceed those of the case of =. Though theorem only guarantees the rationality, numerical examples show that it is possible to improve the learning speeds and qualities. 5 Conclusions In most multi-agent reinforcement learning systems, reinforcement learning methods for single-agent systems are used. Though it is important to share a reward among all agents in multi-agent reinforcement learning systems, conventional work has used ad hoc sharing schemes. In this paper, we focus on the Rationality Theorem of Prot Sharing and analyze how to share a reward among all prot sharing agents. We show the necessary and sucient condition to preserve the rationality condition in multiagent reinforcement learning systems. If we use this theorem, we can experience multiple ecient aspects of reward sharing, including improvement of learning speeds and qualities, without the least desirable situation where expected reward per an action is zero. Our future projects include : ) to analyze the improvement eect of learning speeds and qualities, 2) to extend to other elds of reinforcement learning, and ) to nd ecient real world applications. Appendix A Proof of Lemma a (b=,c=) b (b=2,c=) c (b=2,c=2) d (b=2,c=2) e (b=,c=) f (b=,c=2) g (b=,c=) sensory input at t+ sensory input at t rule Fig.. Conict structures used in proof. Reinforcement of an irrational rule makes it dicult to preserve the rationality condition under any. Therefore, the diculty of a conict structure is monotonic to the number of reinforcements for irrational rules. We enumerate conict

13 structures according to the branching factor b (the number of state-transitions in the same sensory input), the conict factor c (the number of conicting rules in it), and the count of reinforcements for irrational rules. Though we set L =, we can extend to any number easily. b = : It is clearly not dicult since there are no conicts (Fig.a). b = 2 : When there are no conicts (Fig.b), it is the same as b =. We divide structures of c = 2 into two subcases. One contains a self-loop (Fig.c), and the other does not (Fig.d). In the case of gure c, there is a possibility that the self-loop rule is selected repeatedly, while the non-self-loop rule is selected once at maximum. Therefore, if the self-loop rule is irrational, it will be reinforced more than the irrational rule of gure d. b : When there are no conicts (Fig.e), it is the same as b =. Consider the structure of c = 2 (Fig.f). Although the most dicult case is that the conict structure has an irrational rule as a self-loop, even such a structure is less dicult than gure c. Considering the structure of c = (Fig.g), two conict rules are irrational because of L =. Therefore, an expected number of reinforcements for one irrational rule is less than of gure f. Similarly, conict structures of b > are less dicult than gure c. From the above discussion, it is concluded that the most dicult conict structure is gure c. Q.E.D. B Proof of Lemma 2 r k r k r2 k... r L k r k r k r2 k... EEE r k L r k n- r k r k n- 2 n-... r n- L k Fig. 2. The most dicult multi-agent structure. For any reinforcement interval k (k = ; ; :::; W ) in some goal-agent set, we show that there is j (j = ; 2; :::; L) satisfying the following condition, S k ij > Sk i ; (5) where Sij k is the weight of jth rational rule (rk ij ) in agent i (Fig.2). First, we consider the ratio of the selection number of rij k to ri. k When n = and L rational rules for each agent are selected by all agents in turn, the minimum of the ratio is maximized (Tab.2). In this case, the following ratio holds (Fig.), r k ij : r k i = : (n )L ()

14 agent number c Table 2. An example of rule sequence for all agents on some k. If `X changes to O' or `O2 changes to O' in this table, the learning in the agent that can select the changing rule occurs more easily. c ~ ~ ; rational rule ~ ; irrational L ~ L the acquisition number of rewards sequence 4 4 : : 4 : 4 sequence2 EEE : : : Fig.. An example of rule sequence. Sequence is more easily learned than sequence 2. Second, we consider weights given to rij k and ri. k When the agent that gets the direct reward senses no similar sensory input in W, the weight given to rij k R is minimized. It is in k = W. On the other hand, when agents that get MW the indirect reward sense the same sensory input in W, the weight given to r k i is maximized. It is R M M ( ( M )W ) in W W. Therefore, it is necessary for satisfying condition (5) to hold the following condition, that is, R > R M MW < M ( ( M )W )(n )L; (7) M M W ( ( M )W )(n )L : (8)

15 It is clearly the sucient condition. Q.E.D. References. Arai, S., Miyazaki, K., and Kobayashi, S.: Generating Cooperative Behavior by Multi-Agent Reinforcement Learning, Proc. of the th European Workshop on Learning Robots, pp.4-57 (997). 2. Arai, S., Miyazaki, K., and Kobayashi, S.: Cranes Control Using Multi-agent Reinforcement Learning, International Conference on Intelligent Autonomous System 5, pp.5-42 (998).. Grefenstette, J. J.: Credit Assignment in Rule Discovery Systems Based on Genetic Algorithms, Machine Learning Vol., pp (988). 4. Holland, J. H.: Escaping Brittleness: The Possibilities of General-Purpose Learning Algorithms Applied to Parallel Rule-Based Sysems, in R.S.Michalsky et al. (eds.), Machine Learning: An Articial Intelligence Approach, Vol.2, pp Morgan Kaufman (98). 5. Miyazaki, K., Yamamura, M., and Kobayashi, S.: On the Rationality of Prot Sharing in Reinforcement Learning, Proc. of the rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp (994).. Miyazaki, K., and Kobayashi, S.: Learning Deterministic Policies in Partially Observable Markov Decision Processes, International Conference on Intelligent Autonomous System 5, pp (998). 7. Ono, N., Ikeda, O. and Rahmani, A.T. : Synthesis of Herding and Specialized Behavior by Modular Q-learning Animats, Proc. of the ALIFE V Poster Presentations, pp.2- (99). 8. Sen, S. and Sekaran, M. : Multiagent Coordination with Learning Classier Systems, in Weiss, G. and Sen, S.(eds.), Adaption and Learning in Multi-agent systems, Berlin, Heidelberg. Springer Verlag, pp.28-2(995). 9. Tan, M. : Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents, Proc. of the th International Conference on Machine Learning, pp.- 7 (99).. Watkins, C. J. H., and Dayan, P.: Technical note: Q-learning, Machine Learning Vol.8, pp.55-8(992).. Weiss, G. : Learning to Coordinate Actions in Multi-Agent Systems, Proc. of the th International Joint Conference on Articial Intelligence, pp.- (99). 2. Whitehead, S. D. and Balland, D. H.: Active perception and Reinforcement Learning, Proc. of the 7th International Conference on Machine Learning, pp.2-9(99). This article was processed using the LaT E X macro package with LLNCS style

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract: