A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

Size: px

Start display at page:

Download "A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation"

Dana Holmes
5 years ago
Views:

1 A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology Takayama, Ikoma, JAPAN {hajime-f,ishii}@is.aist-nara.ac.jp CREST, Japan Science and Technology Agency Abstract This article presents the state estimation method based on Monte Carlo sampling in a partially observable situation. We formulate an automatic strategy acquisition problem for the multi-agent card game Hearts as a reinforcement learning (RL) problem. Since there are often a lot of unobservable cards in this game, RL is dealt with in the framework of a partially observable Markov decision process (POMDP). We apply a Monte Carlo method in order to estimate unobservable states. Simulation results show our model-based POMDP-RL method with Monte Carlo state estimation is applicable to this realistic multi-agent problem. 1 Introduction Studies on reinforcement learning (RL) schemes in multi-agent environments have so far been focusing on gamestrategy acquisition problems, which usually assume completely observable situations. For example, BlackJack [5], Othello [9] and Backgammon [7] have been studied. In more realistic applications, however, there is partial observability, and hence partially observable Markov decision processes (POMDPs) have attracted much attention in recent years. POMDPs provide a way to choose optimal actions in partially observable situations. For example, Kaelbling et al. presented a method that obtains the optimal solution of POMDPs [4], but in general it is difficult to achieve this. One reason is that the estimation of missing information [2], which is crucial for solving a POMDP, is often intractable due to a large state space. The approximation method, then, is essential to solve a realistic POMDP problem. In our previous study [3], we dealt with the card game Hearts (a four-player, non-cooperative, finite-state, zero-sum and imperfect-information game) and proposed an RL scheme that used the prediction of environmental behaviors and the estimation of unobservable state variables. We formulated the automatic strategy acquisition problem of this card game as a POMDP, such that the state transition was approximated as one defined on mean-field approximation states. Although the achieved RL agent grew stronger than rule-based agents, further improvement was possible; mean-field approximation was effective, but was not very accurate because the states of Hearts are defined as discrete. It is well-known that the analog approximation of states makes the value (evaluation of state) smaller than its optimum. One possible way of avoiding this difficulty is to estimate the unobservable variables based on a sampling method, such as a Monte Carlo method [8]. This article proposes a state estimation method based on Monte Carlo sampling in a realistic partially observable situation, i.e., the card game Hearts. Since the state space of this game is huge, we focus on pessimistic observations, each of which is predicted to be detrimental to the learning agent, in order to carve out an important domain from such a large space. We then use Monte Carlo sampling over such future observations, to estimate hidden states and calculate their likelihood based on prediction of environmental behaviors. This allows the state space to shrink and makes estimation of the real world more effective. We carried out experiments using rule-based agents which are stronger than the previous ones [3]; a new rulebased agent has more than 50 rules and is similar to an experienced-level human player of the game Hearts. The learning agent trained by our improved POMDP-RL method exhibits high performance against the rule-based agents, and is dramatically better than an agent trained by our previous method in its learning speed and strength. Thus, it is concluded that our model-based POMDP-RL method with the Monte Carlo state estimation is applicable to a realistic multi-agent problem.

2 2 Preparation 2.1 Review of POMDPs This section describes the essential features of POMDPs which are used throughout this article. A POMDP framework consists of (1) a set of real states S = {s 1, s 2,, s S }, which cannot be determined with complete certainty, (2) a set of observation states O = {o 1, o 2,, o O }, which can be perceived by the agents, (3) a set of actions A = {a 1, a 2,, a A }, which can be executed by the agents, and, (4) a reward function R : S A R, which maps a pair of a state and an action into a numerical reward. The dynamics of this model can be represented as transition probability P(s t+1 s t, a t ) and observation probability P(o t s t ). Since the agents cannot perceive real states, they maintain belief states, each of which is a probability distribution over S and primarily a sufficient statistic of the whole history H t = {o t, a t 1, o t 1,, a 1, o 1 }. The belief state is updated according to an incremental Bayes rule: b(s t+1 ) = P(s t+1 o t+1, a t, H t ) = P(o t+1 s t+1, a t, H t )P(s t+1 a t, H t ) P(o t+1 a t, H t ) = P(o t+1 s t+1 ) s t P(s t+1 s t, a t )b(s t ). (1) P(o t+1 a t, H t ) Here, b(s t ) P(s t H t ) B represents a belief state about a real state s t. The main problem in a POMDP is to make the agents learn a policy π : B A which maximizes the expected numerical reward. A formulation of a belief-state MDP presents a way to achieve this, by calculating the value function for a belief state: V π (b t ) = s t b(s t )V π (s t ). (2) Here, V π (b t ) and V π (s t ) represent the value function based on a policy π for b(s t ) and s t, respectively. Since V π (b t ) is linear with respect to b t b(s t ), the optimal V (b t ) is the upper surface of the collection of the state value functions, and hence the optimal value function V (b t ) is piecewise-linear and concave. In general, it is difficult to calculate equations (1) and (2) because the set of real states S is huge, and therefore summation over s is intractable. So, some approximation method is needed. In our previous study [3], we used mean-field approximation, which calculated the expectation value of states to avoid this intractability. Although this approximation achieved a good performance, there was some error in evaluating the current optimal value. By applying an analog approximation like the mean-field approximation to problems whose states are defined as discrete, like the card game Hearts, there is a performance loss due to the under-estimation of the value function. One possible way to solve this difficulty is to use some sampling method, e.g., a Monte Carlo method, and to estimate the unobservable states using an ensemble of sampled discrete states. The details of this method are described below (see Proposed method). 2.2 POMDP formulation The game of Hearts has partially observability for each of the four players. Then, if we assume that there is a single learning agent and three opponent agents which do not learn in the environment, this game is approximated as a partially observable Markov decision process (POMDP). That is, the opponent agents constitute the static environment. A single state transition of the game is represented by: (1) a real state s that includes every card (observable and unobservable) allocation; (2) observation o for the learning agent, i.e., the cards in the agent s hand and the cards that have already been played in past time steps and in the current time; (3) the agent s action a, i.e., a single play at its turn; and (4) the strategy φ of each of the opponent agents. Let t indicate a playing turn of the learning agent. t = 14 indicates the end state of the game 1. In the following descriptions, we assume that there are three opponent agents intervening between the t-th play and the (t + 1)-th play of the learning agent 2. 1 There are 13 time steps in the card game Hearts. See later section for the details of this game s rules. 2 If the first player of the t-th time step and of the (t + 1)-th time step are different, the number of intervening opponent agents is not three. However, the following explanation is still valid in such a case.

3 Between the t-th play and the (t + 1)-th play of the learning agent, then, there are three state transitions due to actions by the three opponent agents. These state transitions are indexed by t. Each of the opponent agents is also in a partial observation situation; state, observation, action and strategy at its t-th playing turn are denoted by s i t, o i t, a i t and φ i t, respectively, where i is the index of an opponent agent. Let M denote the learning agent, and M i (i = 1, 2, 3) denote the i-th opponent agent. We assume that agent M i probabilistically determines its action a i t for its own observation o i t at its t-th play. Under this assumption, the state transition between the t-th play and the (t + 1)-th play of the learning agent is given by P(s t+1 s t, a t, Φ) = P(s j+1 t s 1 t,s2 t,s3 t a 1 t,a2 t,a3 j=0 t s j t, aj t ) i=1 o i t P(a i t oi t, φi )P(o i t si t ), (3) where s 0 t = s t, a 0 t = a t, s 4 t = s t+1, a 4 t = a t+1, o 0 t = o t and o 4 t = o t+1. Φ {φ i : i = 1, 2, 3}, where φ i denotes the strategy of opponent agent M i. Since the game process of Hearts is deterministic, there are three facts: Fact 1 New state s i+1 t, which is reached from a previous state s i t by an action a i t, is uniquely determined. Namely, P(s i+1 t s i t, ai t ) is 1 for a certain state and 0 for the other states. Fact 2 Observation o i t is uniquely determined at state si t. Namely, P(oi t si t ) is 1 for a certain observation state and 0 for the other observation states. Fact 3 The penalty point at time t is determined only by the actual observation o t+1, By assuming a reward is given by the penalty point received in a turn, the reward r t is dependent only on the observation o t+1. Since state s t is not observable for each agent, the learning agent could estimate it using the history H t {(o t,, ), (o t 1, a t 1, a 1,2,3 t 1 ),..., (o 1, a 1, a 1,2,3 1 )}, and actions a i t(i = 1, 2, 3) at the t-th play. To predict the rewards, however, it is not necessary to estimate s t, but enough to estimate o t+1, because of the above Fact 3. Using this property, we propose the state estimation method to predict the expected reward effectively in the next section. 3 Proposed method 3.1 State estimation by Monte Carlo sampling In the POMDP framework, we need to consider all possible combinations of unobservable states and to calculate summation over the possible states (see equations (1) and (2)). However, this is computationally intractable in problems whose state space is huge. Additionally, Hearts has the following difficulty peculiar to realistic problems. Since an ordinary 52-card deck is used in this game, each state is represented by 52-dimensional vector. Although there are 13 cards in the learning agent s hand, which are observable variables, the remaining unobservable cards occupy 39-dimension in each state. Moreover, each player plays a card in turn, so the state space of unobservable variables shrink by three dimensions to its subspace at every time step. Therefore, if we estimate primitive states from such a high-dimensional and difficult state space based on a sampling method, it does not perform well because sampled states in shrunken state space have little meaning, and expected values calculated from such sampled states are not accurate. According to our method, therefore, we first sample a next observation ô t+1 from the subspace O t+1 which is extracted from the possible next observation space O t+1 before estimating the current state s t. Since the effective dimension of O t+1 is larger by three than that of O t, and it is enough for the sampling method to focus on the three dimension; this restriction on the observation space is helpful for restricting the possible state space. Next, we estimate unobservable states from the state subspace restricted by each sampled observation state, and predict the environmental change starting from each of the estimated states. More concretely, we consider a specific next observation space Ot+1 which is extracted from the possible next observation space O t+1 such that every next observation state ô t+1 Ot+1 is predicted to be detrimental to the learning agent. Such an observation state ô t+1 is called a pessimistic observation. Three cards played at time t are determined only by the difference between each sampled observation ô t+1 and the current observation o t, and the winner at time t is also determined. Here, we assume for simplicity that ô t+1 includes the order of cards that have

4 played at time t. The state space S t (ô t+1 ) which is restricted depending on each sampled observation ô t+1 is the subspace of the whole state space S t. The dimension of the subspace S t (ô t+1 ) is smaller than that of S t almost by three. By sampling an unobservable state from the restricted state space, the likelihood of the next-time state which produces a pessimistic observation ô t+1 can be calculated based on the dynamics of the environment. Note that each pessimistic observation is generated according to opponent agents strategies, assuming that they take actions detrimental to the learning agent (see Action control by state prediction, below). This idea for reducing the state space to explore is similar to the Min-Max search for a game decision tree. The pessimistic observations are generated from the current history H t based on a set of rules, which is easily performed due to the above Fact 3. Second, we sample a possible current state ŝ t (k, n) S t (ô t+1 ) such to be consistent with a generated pessimistic observation ô t+1 (k), where k is an index of pessimistic observations and n is an index of sampled current states. Let K denote the number of pessimistic observations ô t+1 and N the number of estimated current states ŝ t (k, n) sampled for each ô t+1. Since there are many current states which are consistent with a pessimistic observation ô t+1 (k), we sample N current states ŝ t (k, n) for each of pessimistic observation ô t+1 (k). As just described, we can carve out an important domain of the state space by restricting the next observation space using the pessimistic observation, and estimate the unobservable state variables using Monte Carlo sampling from such a shrunken subspace. Using the above state estimation method, the learning agent calculates the likelihood of each pessimistic observation ô t+1 (k), as described in the next section. 3.2 State prediction The transition probability for the observation state is given by P(o t+1 a t, Φ, H t ) = s t S t s t+1 S t+1 P(o t+1 s t+1 )P(s t+1 s t, a t, Φ)P(s t H t ). (4) From the above Facts 1 and 2 and equation (3), equation (4) is transformed into P(o t+1 a t, Φ, H t ) = s t S t (a 1 t,a2 t,a3 t ) A i=1 t+1 P(a i t oi t, φi, H i )P(s t H t ), (5) where P(s t H t ) is a belief state. A t+1 denotes the set of possible (a1 t, a 2 t, a 3 t) by which the previous state-action pair (s t, a t ) reaches a new state s t+1 whose observation state is o t+1. Note that A t+1 corresponds to the difference between the next observation space O t+1 and the current observation space O t. Equation (5) provides a model of the environmental dynamics. However, the calculation in equation (5) has two difficulties. One is the intractability of the belief state; since the state space S t of the game of Hearts is huge, the rigorous calculation of the summation s t S t is difficult. The other is the difficulty in retrieving the game tree; since the next observation space O t+1 is huge, A t+1 which is the difference between it and the current observation space O t is also huge, and then the calculation of the summation is also difficult. (a 1 t,a2 t,a3 t ) A t+1 To avoid these computational difficulties, we approximate equation (5) using Monte Carlo sampling described in section 3.1 as P(o t+1 a t, Φ, H t ) ŝ t S t (ôt+1) (â 1 t,â2 t,â3 t ) A t+1 (ôt+1) i=1 = 1 N K N k=1 n=1 i=1 P(â i t ôi t, φi, Ht i )P(ŝ t H t ) P(â i t(k) ô i t(k, n), ˆφ i, Ĥi t(k, n)). (6) Note that when the next observation space O t+1 is restricted to Ot+1, S t and A t+1 are also restricted to S t (ô t+1) and A t+1 (ô t+1), respectively, and the computational difficulties described above can be avoided. ô i t (k, n) denotes an observation of the opponent agent M i. Using the property that every card played at the t-th time step is contained in the next observation o t+1, if we pick up the k-th pessimistic observation, a next history Ĥ t+1 (k) = H t {ô t+1 (k), a t, â 1 t (k), â2 t (k), â3 t (k)} can be determined without any ambiguity. Moreover, if we

5 consider the n-th current sampled state ŝ t (k, n) for the k-th next history Ĥt+1(k), the observation state of the opponent agent, ô i t(k, n), can be also determined. â i t(k) denotes an action of the opponent agent M i depending on the k-th pessimistic observation, and Ĥi t (k, n) denotes an opponent agent s history depending on ôi t (k, n). ˆφi represents the policy approximated by the learning agent, which is assumed to produce actions of agent M i, but differs from the real strategy φ i in equation (3) 3. The action selection probability of the i-th opponent agent, P(â i t ôi t (k, n), ˆφ i t, Ĥi t (k, n)), is calculated by a function approximator whose input is ôi t (k, n). We assume P(s t = ŝ t (k, n)) = 1 N, that is, the current states are sampled from the restricted state space S t with uniform probability. With this assumption, the sampling method does not use the incremental Bayes inference for the belief state estimation. However, it has no need to maintain the belief state which is a probability distribution over a huge unobservable state space. Calculating equation (6) approximately based on our state estimation method, the learning agent can predict the observation dynamics and select the optimal action based on the environmental behavior. 3.3 Action control by state prediction According to our RL method, an action is selected based on the expected TD error, which is defined by δ t (a t ) = R(o t+1 ) (a t ) + γ V (o t+1 ) (a t ) V (o t ), (7) where f(o t+1 ) (a t ) P(o t+1 a t, Φ, H t )f(o t+1 ) (8) o t+1 O t+1 and P(o t+1 a t, Φ, H t ) is given by equation (6). The expected TD error considers the estimation of unobservable states and the strategies of the other agents. Using the expected TD error, the action selection probability is determined as P(a t o t ) = exp( δ t (a t )/T m ) a t A exp( δ t (a t )/T m ). (9) T m is a parameter controlling the action randomness. Our RL method uses the TD error expected with respect to the estimated transition probability for the observation state. An action is then determined based on the estimated environmental model. Such an RL method is often called a model-based RL method. 3.4 Two modules Our RL architecture consists of two modules: a state evaluation module and an action control module. The action control module consists of three action predictors, each corresponding to one of the three opponent agents, and one action selector. The state evaluation module has the same role as the critic in the actor-critic algorithm [1], i.e., it maintains the value function V (o t ) that evaluates an observation state o t. Since we can estimate the value of every pessimistic observation ô t+1, the learning agent can learn a policy based on environmental behaviors by calculating the expectation of such evaluation with respect to the estimated transition probability for the observation state, according to equations (7) and (8). In the action selection module, there are three action predictors. The action predictor for agent M i predicts a card played by that agent, in a similar manner to the actor s action selection in the actor-critic algorithm. To predict an action by agent M i at its t-th turn, the i-th action predictor calculates the merit function value U i (ô i t(k, n), â i t(k)) for the opponent agents observation ô i t(k, n), which is derived from the sampled state ŝ t (k, n), and an opponent agents action â i t (k), which is determined by a pessimistic observation ô t+1(k). After calculating 3 We have assumed that agent M i probabilistically determines its action a i t for its own observation oi t.

6 the merit value, the probability for an action â i t (k), which is used in equation (6), is given by P(â i t (k) ôi t (k, n), ˆφ i, Ĥi t ) = exp(ui (ô i t (k, n), âi t (k))/t i ) A i exp(u i (ô i t (k, n), âi t (k))/t i ). (10) Here, A i denotes the set of possible actions for agent M i, and T i is a constant that denotes the assumed randomness of the action selection of agent M i. When training the action predictor for agent M i, the merit function U i (ô i t(k, n), â i t(k)) is updated similarly to the actor learning. o i t is reproduced by replaying a past game, and ai t is the action actually taken by agent Mi at its t-th play in the past game. To accelerate the learning of the state evaluation module and action predictor, we use a feature extraction technique for the input and output of the function approximators of those modules. Since the state space of a realistic problem like the game Hearts is huge and it is difficult for the learning agent to experience every possible state, the generalization ability of function approximators is very important. In this study, we use normalized Gaussian networks (NGnets) as function approximators. An NGnet can be defined as a probabilistic model, and its maximum likelihood inference is performed by an on-line expectation-maximization (EM) algorithm [6]. The on-line EM algorithm is based on a stochastic gradient method, and is faster than gradient methods. Therefore, the learning of the action predictors is so fast that our RL method can be applied to a situation where the strategies of the opponent agents may change with time. 4 Experiment 4.1 The card game Hearts Hearts is played by four players and uses an ordinary 52-card deck. There is an order of strength within each suit (i.e. decreasing from A,K,Q,...,2). Cards are distributed to the four players so that each player has 13 cards at the beginning of the game. Thereafter, each player plays a card in clockwise order. The game ends when 13 time steps have been carried out. After each time step, the player that has played the strongest card becomes the winner of that time step. Each heart incurs a one-point penalty, while the Q incurs a 13-point penalty. The winner of each time step receives all of the penalty points of the cards played in that time step. According to the rules above, a single game is played, and at the end of the game the score of each player is determined as the sum of the received points. The lower the score, the better. Players must therefore contrive strategies to avoid receiving penalty points. 4.2 Single agent learning in a stationary environment We carried out experiments using one learning agent and three rule-based opponent agents. A newly developed rule-based agent has more than 50 rules and is similar to an experienced-level human player of the game Hearts. The acquired penalty ratio is 0.42 when an agent which only took out permitted cards at random from its hand challenged the three rule-based agents. The acquired penalty ratio is the ratio of the acquired penalty points of the learning agent to the total penalty points of the four agents; that is, a random agent acquired about 2.1-fold points of rule-based agents on average. The rule-based agent used in this experiment is much stronger than the previous one [3]. The learning agent based on our previous RL method could not get stronger than this rule-based agent even after 120,000 training games (see Figure 2, detailed data not shown). Figure 1 shows the learning curve of an agent trained by our new RL method when it challenged the three rule-based agents. This learning curve is an average of three learning runs, each consisting of 100,000 training games. The acquired penalty ratio of the learning agent decreases from the beginning of learning episodes. This simulation result shows that the learning agent based on our proposed method shows a good performance against experienced-level rule-based agents. Figure 2 shows the comparison of the learning agent trained by our improved RL method with a learning agent trained by our previous method. The learning curves are averages of three learning runs, each consisting of 80,000 training games, when each agent challenged the three rule-based agents. The simulation result shows that our new RL method is improved dramatically in comparison to our previous one in its learning speed and strength. There are two reasons for this successful result. First, the learning agent can avoid the worst consequences by generating pessimistic observation, which also makes the state space to shrink effectively. Second, our state

7 Proposed RL agent Proposed RL agent with mean field state estimation averaged aquired penalty ratio Rule based agents averaged aquired penalty ratio Proposed RL agent with Monte Carlo state estimation number of games x 10 4 Figure 1: Proposed RL agent vs. rule-based agents. Penalty ratio is smoothed by using 2000 games just before that number of training games number of games x 10 4 Figure 2: The comparison between the proposed RL agent and an agent trained by our previous RL method. Penalty ratio is smoothed by using 2000 games just before that number of training games. estimation based on the Monte Carlo method over discrete states can effectively eliminate the above-mentioned performance loss. 5 Summary In this article, we proposed a state estimation method based on Monte Carlo sampling in a partially observable world, and designed an autonomous learning agent that plays the multi-player card game Hearts. Since Hearts is a realistic imperfect-information game and its state space is huge, the RL scheme is formulated as a POMDP and an approximated method is required to estimate unobservable state variables. By dealing with pessimistic observation, we carved out the important domain from the huge state space and estimated transition probability for a discrete state sampled depending on the pessimistic observations. This proposed method allowed us to solve the performance loss arising from analog approximation, and to avoid computational intractability. Thus, it copes with partial observability in a multi-agent environment. Our model-based POMDP-RL method with sampling-based estimation of unobservable states has successfully been applied to a realistic multi-agent problem. References [1] Barto, A.; Sutton, R.; and Anderson, C Neuronlike adaptive elements that can solve difficult learning control problems. In IEEE Trans. Syst., Man & Cybern., , Vol.13. [2] Ginsberg, M Imperfect information in a computationally challenging fame. Journal of Artificial Intelligence Research, , 14. [3] H.Fujita, Y.Matsuno and S.Ishii. A reinforcement learning scheme for a multi-agent card game. In IEEE Int. Conf. on Syst., Man. & Cybern., pp , [4] L.P.Kaelbling, M.L.Littman, and A.R.Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, Vol.101, 1998.

8 [5] Pèrez-Uribe, A. and Sanchez, A Blackjack as a test bed for learning strategies in neural networks. In Proc. IEEE International Joint Conference Neural Networks, , Vol.3. [6] Sato, M. and Ishii, S On-line EM algorithm for the normalized Gaussian Network. Neural Computation, , Vol.12, No.2. [7] Tesauro, G TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, , Vol.6, No.2. [8] Thrun, S Monte Carlo POMDPs. In Advances in Neural Information Processing Systems, [9] Yoshioka, T.; Ishii, S.; and Ito, M Strategy acquisition for game othello based on min-max reinforcement learning. IEICE Trans. Inf & Syst., , Vol.E82 D, No.12.

Reinforcement Learning based on On-line EM Algorithm

Reinforcement Learning based on On-line EM Algorithm Masa-aki Sato t t ATR Human Information Processing Research Laboratories Seika, Kyoto 619-0288, Japan masaaki@hip.atr.co.jp Shin Ishii +t tnara Institute