Decayed Markov Chain Monte Carlo for Interactive POMDPs

Size: px

Start display at page:

Download "Decayed Markov Chain Monte Carlo for Interactive POMDPs"

Kelley McDowell
5 years ago
Views:

1 Decayed Markov Chain Monte Carlo for Interactive POMDPs Yanlin Han Piotr Gmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago, IL Abstract To act optimally in a partially observable, stochastic and multi-agent environment, an autonomous agent needs to maintain a belief of the world at any given time. An extension of partially observable Markov decision processes (POMDPs), called interactive POMDPs (I-POMDPs), provides a principled framework for planning and acting in such settings. I-POMDP augments the POMDP beliefs by including models of other agents in the state space, which forms a hierarchical belief structure that represents an agent s belief about the physical state, belief about the other agents and their beliefs about others beliefs. This nested hierarchy results in a dramatic increase of the belief space complexity. In order to perform belief update in such settings, we propose a new approximating method that utilizes decayed Markov chain Monte Carlo (D-MCMC). For problems of various complexities, we show that our approach effectively mitigates the belief space complexity and competes with other Monte Carlo sampling algorithms for multi-agent systems, such as interactive particle filter (I-PF). We also give comparisons on their accuracy and efficiency, and then suggests applicable scenarios for each algorithms. 1 Introduction Partially observable Markov decision processes (POMDPs) (Kaelbling, Littman, and Cassandra 1998) provides a principled, decision-theoretic framework for planning under uncertainty in a partially observable, stochastic environment. An autonomous agent operates rationally in such settings by maintaining a belief of the physical state at any given time, in doing so it sequentially chooses the optimal actions that maximize the future rewards. Therefore, solutions of POMDPs are mappings from an agent s beliefs to actions. Although POMDPs can be used in multi-agent settings, it is doing so under strong assumptions that the effects of other agents actions are implicitly treated as noise and fold in state transitions, such as recent Bayes-adaptive POMDPs (Ross, Draa, and Pineau 2007), infinite generalized policy representation (Liu, Liao, and Carin 2011), infinite POMDPs (Doshi-Velez et al. 2013). Thus, an agent s beliefs about other agents are not in the solutions of POMDPs. Interactive POMDPs (I-POMDPs) (Gmytrasiewicz and Doshi 2005) are a generalization of POMDPs to multi-agent settings by replacing POMDP belief spaces with interactive hierarchical belief systems. Specifically, it augments the plain beliefs about physical states in POMDP by including models of other agents in the state space, which forms a hierarchical belief structure. The models of other agents included in the new augmented state space comprise two types: the intentional models and subintentional models. The former ascribes beliefs, preferences, and rationality to other agents (Gmytrasiewicz and Doshi 2005), while the latter, such as finite state controllers (Panella and Gmytrasiewicz 2016), does not. We focus on intentional models in this paper. Solutions of I- POMDPs map an agent s belief about the environment and other agents models to actions. Therefore, it is applicable to all important agent, human, and mixed agent-human applications. It has been 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

2 clearly shown (Gmytrasiewicz and Doshi 2005) that the added sophistication for modeling others as rational agents results in a higher value function which dominates the one obtained from simply treating others as noise. However, for I-POMDPs, the interactive belief modification results in a dramatic increase of the belief space complexity, adding to the curse of dimensionality: the complexity of the belief representation is proportional to belief dimensions, due to exponential growth of agent models with increase of nesting level. Since exact solutions to POMDPs are proven to be PSPACE-complete for finite horizon and undecidable for infinite time horizon (Papadimitriou and Tsitsiklis 1987), the time complexity of generalized I-POMDPs, which may contain multiple POMDPs and I-POMDPs depending on its actual nesting level, is greater than or equal to PSPACE-complete for finite horizon and undecidable for infinite time horizon. Therefore, in order to apply I-POMDPs to more realistic settings, a good approximation algorithm for computing the nested interactive beliefs is crucial to the trade-off between solution quality and computations. To address this issue, we propose methods that utilize Monte Carlo sampling algorithms to obtain approximating solutions to I-POMDPs. Specifically, we use decayed Markov chain Monte Carlo (D-MCMC) (Marthi et al. 2002) to concentrate on sampling beliefs more frequently from their recent past, and use interactive particle filter (I-PF) (Doshi and Gmytrasiewicz 2009) to descend the belief hierarchies and sample them at each level. Applying them to I-POMDPs is nontrivial, since it needs generalization of D-MCMC to multi-agent settings with guaranteed convergence and improvements of I-PF for the infinite time horizon and modeled agent of our interest. Our method significantly mitigates the belief space complexity of I-POMDPs, it works effectively and efficiently on various settings of multi-agent problems. Compared with other sampling methods for I-POMDP such as I-PF, the generalized D-MCMC is competitive, non-divergent, and advantageous on certain application scenarios. 2 Background 2.1 POMDP A Partially observable Markov decision process (POMDP) (Kaelbling, Littman, and Cassandra 1998) is a general model for planning and acting in a single-agent, partially observable, stochastic domain. It is defined for a single agent i as: P OMDP i = S, A i, Ω i, T i, O i, R i (1) where S is the set of states of the environment; A i is the set of agent i s possible actions; Ω i is the set of agent i s possible observations; T i : S A i S [0, 1] is the state transition function; O i : S A i Ω i [0, 1] is the observation function; R i : S A i R i is the reward function. An agent s belief about the state can be represented as a probability distribution over S. The belief update can be simply done using the following formula, where α is the normalizing constant: b (s ) = αo(o, s, a) s S T (s, a, s)b(s) (2) Then the optimal action, a, is part of the set of optimal actions, OP T (b i ), for the belief state defined as: { OP T (b i ) = arg max b i (s)r(s, a i ) + γ } P (o i a i, b i )U(SE(b i, a i, o i )) (3) a i A i s S o i Ω i 2.2 Markov Chain Monte Carlo The Markov Chain Monte Carlo (MCMC) method (Gilks et al., 1996) is widely used to approximate probability distributions when they are unable to be computed directly. It generates samples from a posterior distribution π(x) over state space x, by simulating a Markov chain p(x x) whose state space is x and stationary distribution is π(x). The samples drawn from p converge to the target distribution π as the number of samples goes to infinity. Such a Markov Chain with appropriate stationary distribution can be constructed specifically using Gibbs sampling (Pearl, 1988), it works on the target distribution π(x) where x = (x 1,..., x t ). For a 2

3 Markov chain p(x x), in each iteration we sample i {1,..., t} and sample x i from its conditional distribution π(x i x j : j i), eventually the stationary distribution of p is π. 3 The Model 3.1 I-POMDP framework An interactive POMDP of agent i, I-POMDP i, is defined as: I-P OMDP i = IS i,l, A, Ω i, T i, O i, R i (4) where IS i,l is the set of interactive states of the environment, defined as IS i,l = S M i,l 1, l 1, where S is the set of states and M i,l 1 is the set of possible models of agent j, and l is the strategy level. A specific class of models are the (l 1)th level intentional models, Θ j,l 1, of agent j: θ j,l 1 = b j,l 1, A, Ω j, T j, O j, R j, OC j, b j,l 1 is agent j s belief nested to level (l 1), b j,l 1 (IS j,l 1 ), and OC j is j s optimality criterion. The intentional model θ j,l 1 can be rewritten as θ j,l 1 = b j,l 1, ˆθ j, where ˆθ j includes all elements of the intentional model other than the belief and is called the agent j s frame. IS i,l could be defined in an inductive manner (note that when ˆθ j is usually known, ˆθ j reduces to b j ): IS i,0 = S, θ j,0 = { b j,0, ˆθ j : b j,0 (S)} IS i,1 = S θ j,0, θ j,1 = { b j,1, ˆθ j : b j,1 (IS j,1 )}... IS i,l = S θ j,l 1, θ j,l = { b j,l, ˆθ j : b j,l (IS j,l )} And all other remaining components in an I-POMDP are similar to those in a POMDP: (5) A = A i A j is the set of joint actions of all agents. Ω i is the set of agent i s possible observations. T i : S A i S [0, 1] is the state transition function. O i : S A i Ω i [0, 1] is the observation function. R i : IS A i R i is the reward function. 3.2 Interactive belief update Given all the definitions above, the interactive belief up-date can be performed as follows: b t i(is t ) =P r(is t b t 1 i, a t 1 =α b(is t 1 ) is t 1 o t j i, o t i) a t 1 j P r(a t 1 j θ t 1 j )T (s t 1, a t 1, s t )O(s t, a t 1, o t i) (6) O j (s t, a t 1, o t j)τ(b t 1 j, a t 1 j, o t j, b t j) Unlike plain belief update in POMDP, the interactive belief update in I-POMDP takes two additional sophistications into account. First, the probabilities of other s actions given its models (the second summation) need to be computed since the state of physical environment now depends on both agents actions. Second, the agent needs to update its beliefs based on the anticipation of what observations the other agent might get and how it updates (the third summation). Then the optimal action, a, for the case of infinite horizon criterion with discounting, is part of the set of optimal actions, OP T (θ i ), for the belief state defined as: { OP T (θ i ) = arg max b is (s)er i (is, a i ) + γ P (o i a i, b i )U( SE θi (b i, a i, o i ), ˆθ } i ) a i A i is IS o i Ω i (7) 3

4 4 Decayed Markov Chain Monte Carlo The original Decayed MCMC method was proposed as a filtering algorithm for problems such as dynamic Bayesian networks (Marthi et al. 2002), but it is limited to a single agent perspective. Here we generalize Decayed MCMC to the multi-agent settings by applying it to such models denoted by the Interactive Dynamic Influence Diagram (IDID) (Doshi, Zeng, and Chen 2009), which is shown in figure 1 as an example of two time slices. IDID explicitly models the I-POMDP structure by Figure 1: Two time slices of an IDID, red ones are known. decomposing it into chance and decision variables and the dependencies between them. It captures the essence of the multi-agent interaction problem under the framework of I-POMDPS thus is a perfect way to make computations. Suppose there are two agents i and j, the subscripts in figure 1 denotes corresponding agents. S is the physical state, O j is agent j s observation, A j is j s action and θ j is j s model. The red nodes O i and A i are agent i s action and observation respectively, and they are the only observable variables in this two agent setting. In order to utilize Gibbs samplings, we identify the known nodes of agent i, and sample all unknown variables from their conditional distributions given everything else. However, this can be easily simplified by sampling from the conditionals given their corresponding Markov blankets at different time steps, since any unknown variables are independent of others given their Markov blankets. Then for every t, we just need to sample from the following conditional distributions: o t j P r(o t j s t, θ t 1 j, a t 1, a t 1 j i ) s t P r(s t s t 1, s t+1, a t 1:t j, a t 1:t i, o t j, o t i) (8) a t j P r(a t j θj, t s t, s t+1, a t i, o t+1 j, o t+1 i ) θj t P r(θj θ t t 1 j, a t 1 j, o t j, a t j) For a particular time step t, say t = T, since our goal is the filtering distribution of all hidden variables given observation histories Oi 1:T, only the last hidden variables need to be retained and all previous one can be discarded. And we use the final sample of the chain from last time step to initialize the new Markov chain for sampling the hidden variables at current time step, hoping that the final sample of the previous time step is near the high probability region of the next step. The actual algorithm is as follows, where x (i) t denotes the ith sample of all variables at time t in the network: Algorithm 1: Decayed MCMC Initialize x 0 1:t For i = 1,..., N sample t from some decay function: d(t) Sample x (i) t from p(x t x (i) t), where x (i) t denotes all components of x (i) except x (i) t 4

5 Accordingly, the key of this algorithm is to appropriately choose the time step at which we sample more frequently. Intuitively, since Markov chain has an exponential forgetting effect: x t k has exponentially small effect on p(x t ), instead of picking t uniformly in Gibbs sampling, we can pick t from a decay function d(t) which decays equal to or slower than the forgetting rate, so that the new sampling algorithm favors sampling from recent past and maintains accuracy in the mean time. Notice that the decay function is essentially a probability of updating state x t k, it must satisfy: 1. d(k) > 0 for all k; 2. d(k) decays no greater than the forgetting rate at state x t k. Hence, an inverse polynomial decay d(k) k α (α > 1)should work well for the purpose, since it dominates the exponential forgetting rate asymptotically. Regarding the time complexity, it is usually analyzed in terms of the time cost of each update step of the sampling Markov chain and mixing time of the chain. The former is simply linear of the interactive state space, but the latter will be dependent of time t since samples need to be updated once at each time step in order for the chain to be mixed. Since our aim here is to accurately sample from the marginal distribution of x t, we follow the same approach of the original D-MCMC algorithm, which uses the marginal mixing time to ensure the filtering distribution at current time step is accurate. It is proved that the marginal mixing time for any given observation sequence is O(1), so the total time complexity is O(1) by adding up the update and mixing time, which is independent of time t. 5 Experiments 5.1 Setup We present results for the multi-agent tiger game (Gmytrasiewicz and Doshi 2005) with various parameters. The multi-agent tiger game is a generalization of the classical single agent tiger game (Kaelbling, Littman, and Cassandra 1998) with adding observations which caused by others actions. The generalized multi-agent game contains additional observations regarding other players, while the state transition and reward function also involve others actions as well. Specifically, the game goes as follows: there are a tiger and a pile of gold behind two doors respectively, two players can both listen for the growl of tiger and the creak caused by the other player, or open doors which resets the tiger s location with equal probability. Their observation toward the tiger and the other player are both relatively high (0.85 and 0.9 respectively). No matter caused by which player, the reward for listening action is -1, opening the tiger door is -100 and opening the gold door is 10. Recall that an interactive POMDP of agent i is defined as a six tuple I-P OMDP i = IS i,l, A, Ω i, T i, O i, R i. Thus for the specific setting of multi-agent tiger problem: IS i,1 = S θ j,0, S = {tiger on the left (TL), tiger on the right (TR)} and θ j,0 = b j,0 = {p(t L), p(t R)}, assuming j s frame is known. A = A i A j is all the combinations of each agent s possible actions: listen (L), open left door (OL) and open right door (OR). Ω i is all the combinations of each agent s possible observations: growl from left (GL) or right (GR), combined with creak from left (CL), right (CR) or silence (S). T i = T j : S A i A j S [0, 1] now becomes a joint state transition probability that involves both actions, the tiger s position gets reset to left/right door with 0.5/0.5 whenever an agent opens the door, and remains with 1 when they both listen. O i : S A i A j Ω i [0, 1] becomes a joint observation probability that involves both actions, the observation accuracy of agent i is the accuracy of hearing a growl (0.85) times the accuracy of hearing a creak (0.9). O j is symmetric of O i in terms of the joint actions. R i : IS A i A j R i : agent i gets -1, -100 and 10 when he listens, opens the wrong door and opens the correct door, respectively and independent of j s actions, and vise versa for agent j. 5

6 For the sake of brevity, we restrict the experiments to a two-agent setting and nesting level of one, but the sampling algorithm is extensible to any number of agents and nesting levels in a straightforward way. For the actual experiments, we firstly fix the number of samples used to be 1000 and run it on a two agent tiger game simulation as described above, for comparing the accuracy and efficiency of D-MCMC and I-PF. Then we change the probabilities of tiger s resetting when agents listen and their observation accuracy, in order to test the mixing time of D-MCMC on different problem settings. Lastly we change the sample size and compare the total error rate of predicting others actions to show the consistencies of these two methods. The experiments were running on a 64-bit Windows 10 computer with Intel Core i5-6200u 2.3GHz CPU, 8GB memory, and Matlab R2015b. 5.2 Results Figure 2: Accuracy and efficiency comparisons. The left plot in figure 2 shows the prediction error of agent j s actions over time steps when i s observation accuracy is increase to 0.95 and every other parameter remains the same as in section 5.1. We see that the errors remain bounded for both sampling algorithms, but I-PF can sometimes lose track of the actual belief (around time step and 35-40) when i s observation function is very deterministic, since the agent now tends to be overly sure and the samples collapse. Meanwhile, D-MCMC does not suffer such a problem on this situation. The right plot in figure 2 is a running time comparison on the standard tiger game settings described in section 5.1. Although running time of both methods are independent of t, when the two methods use equal number of samples (1000), I-PF is slightly efficient than D-MCMC due to its recursive nature that the samples drawn at previous time step are reused at next step. On the contrary, D-MCMC is non-recursive, consults from part of the history, and also needs some mixing time at each time step before the samples from the sampling Markov chain can be actually used. However, the difference of their time complexity is up to a constant factor, thus D-MCMC is still a better choice when dealing with observation outliers. Figure 3 shows some important intrinsic properties of D-MCMC on I-POMDP. The left plot is the mixing time of D-MCMC over observation history. For tiger game with different transitions (tiger remains with 0.5 and 0.9 when door opens) and observation functions (hearing accuracy of 0.65 and 0.85), this experiment shows that mixing time of D-MCMC increases with transition determinism and decrease with observation determinism, since the increasing importance of the observations means history becomes less relevant. However, the mixing time remains bounded on both sampling methods as time step increases. The right plot in figure 3 shows the bounded total error rates (consistency) for the standard tiger game settings as the number of samples increases. We observed that the error rate of D-MCMC drops faster than I-PF until they become very close after using roughly 1000 samples, and D-MCMC remains slightly lower error rate that I-PF due to its better performance when confronting outliers. Intuitively 6

7 Figure 3: Mixing time and consistency comparisons. D-MCMC consults from history and current observation, in which it should receive more information that leads to a higher accuracy. Table 1: Comparisons of major differences between I-PF and D-MCMC Name Samples Resampling RecursionDivergence Major Drawbacks D-MCMC I-PF entire sate trajectories most recent state resample state at any time only resample current state nonrecursivdivergent non- recursive sometimes divergent incapable of determinism; slightly slower high-dimensional inefficient; errors propagate forward Lastly, we give a detailed comparison in table 1 regarding important aspects of D-MCMC and I-PF. To summarize, when dealing with a complicated problem which might generate frequent observation outliers, the D-MCMC is guaranteed to be non-divergent while I-PF may temporarily lose track of the real posterior distribution. A complex, high dimensional state space or a tight observation model could be evident signs of such scenarios. However, there is a trade-off between accuracy and efficiency, since D-MCMC tackles these scenarios at the cost of sampling from recent history and also consumes a small amount of time for the burn-in period. 6 Conclusions and Future Work We have described a new method to approximate the belief update in I-POMDP settings and used it to sample from the interactive beliefs. The results show that our approach mitigates the belief space complexity, tackles observation outliers, and competes with other sampling algorithms in terms of the prediction accuracy toward others actions. Although empirical results have shown that D-MCMC is non-divergent, in the future we will formally prove its convergence on I-POMDPs. Also more examples with high dimensionality state space can be tested, in which situation D-MCMC should outperform I-PF since the latter suffers a long recovering time using the state transition model as the dimension increases. References Doshi, P., and Gmytrasiewicz, P. J Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research 34: Doshi-Velez, F. and Konidaris, G., Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. arxiv preprint arxiv: Doshi, P., Zeng, Y., and Chen, Q Graphical models for interactive POMDPs: representations and solutions. Autonomous Agents and Multi-Agent Systems 18.3: Gmytrasiewicz, P. J., and Doshi, P A 7

8 framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research 24(1): Gilks, W.R., Richardson, S. and Spiegelhalter, D.J., Introducing markov chain monte carlo. Markov chain Monte Carlo in practice, 1, p.19. Gmytrasiewicz, P. J., and Doshi, P A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research 24(1): Kaelbling, L.P., Littman, M.L. and Cassandra, A.R., Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1), pp Liu, M., Liao, X. and Carin, L., The infinite regionalized policy representation. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp ). Marthi, B., Pasula, H., Russell, S. and Peres, Y., Decayed MCMC filtering. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, Panella, A. and Gmytrasiewicz, P., 2016, March. Bayesian Learning of Other Agents Finite Controllers for Interactive POMDPs. In Thirtieth AAAI Conference on Artificial Intelligence. Papadimitriou, C.H. and Tsitsiklis, J.N., The complexity of Markov decision processes. Mathematics of operations research, 12(3), pp Pearl, J., Probabilistic reasoning in intelligent systems: Networks of plausible reasoning. Ross, S., Chaib-draa, B. and Pineau, J., Bayes-adaptive pomdps. In Advances in neural information processing systems (pp ). 8

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our