Decayed Markov Chain Monte Carlo for Interactive POMDPs
|
|
- Kelley McDowell
- 5 years ago
- Views:
Transcription
1 Decayed Markov Chain Monte Carlo for Interactive POMDPs Yanlin Han Piotr Gmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago, IL Abstract To act optimally in a partially observable, stochastic and multi-agent environment, an autonomous agent needs to maintain a belief of the world at any given time. An extension of partially observable Markov decision processes (POMDPs), called interactive POMDPs (I-POMDPs), provides a principled framework for planning and acting in such settings. I-POMDP augments the POMDP beliefs by including models of other agents in the state space, which forms a hierarchical belief structure that represents an agent s belief about the physical state, belief about the other agents and their beliefs about others beliefs. This nested hierarchy results in a dramatic increase of the belief space complexity. In order to perform belief update in such settings, we propose a new approximating method that utilizes decayed Markov chain Monte Carlo (D-MCMC). For problems of various complexities, we show that our approach effectively mitigates the belief space complexity and competes with other Monte Carlo sampling algorithms for multi-agent systems, such as interactive particle filter (I-PF). We also give comparisons on their accuracy and efficiency, and then suggests applicable scenarios for each algorithms. 1 Introduction Partially observable Markov decision processes (POMDPs) (Kaelbling, Littman, and Cassandra 1998) provides a principled, decision-theoretic framework for planning under uncertainty in a partially observable, stochastic environment. An autonomous agent operates rationally in such settings by maintaining a belief of the physical state at any given time, in doing so it sequentially chooses the optimal actions that maximize the future rewards. Therefore, solutions of POMDPs are mappings from an agent s beliefs to actions. Although POMDPs can be used in multi-agent settings, it is doing so under strong assumptions that the effects of other agents actions are implicitly treated as noise and fold in state transitions, such as recent Bayes-adaptive POMDPs (Ross, Draa, and Pineau 2007), infinite generalized policy representation (Liu, Liao, and Carin 2011), infinite POMDPs (Doshi-Velez et al. 2013). Thus, an agent s beliefs about other agents are not in the solutions of POMDPs. Interactive POMDPs (I-POMDPs) (Gmytrasiewicz and Doshi 2005) are a generalization of POMDPs to multi-agent settings by replacing POMDP belief spaces with interactive hierarchical belief systems. Specifically, it augments the plain beliefs about physical states in POMDP by including models of other agents in the state space, which forms a hierarchical belief structure. The models of other agents included in the new augmented state space comprise two types: the intentional models and subintentional models. The former ascribes beliefs, preferences, and rationality to other agents (Gmytrasiewicz and Doshi 2005), while the latter, such as finite state controllers (Panella and Gmytrasiewicz 2016), does not. We focus on intentional models in this paper. Solutions of I- POMDPs map an agent s belief about the environment and other agents models to actions. Therefore, it is applicable to all important agent, human, and mixed agent-human applications. It has been 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
2 clearly shown (Gmytrasiewicz and Doshi 2005) that the added sophistication for modeling others as rational agents results in a higher value function which dominates the one obtained from simply treating others as noise. However, for I-POMDPs, the interactive belief modification results in a dramatic increase of the belief space complexity, adding to the curse of dimensionality: the complexity of the belief representation is proportional to belief dimensions, due to exponential growth of agent models with increase of nesting level. Since exact solutions to POMDPs are proven to be PSPACE-complete for finite horizon and undecidable for infinite time horizon (Papadimitriou and Tsitsiklis 1987), the time complexity of generalized I-POMDPs, which may contain multiple POMDPs and I-POMDPs depending on its actual nesting level, is greater than or equal to PSPACE-complete for finite horizon and undecidable for infinite time horizon. Therefore, in order to apply I-POMDPs to more realistic settings, a good approximation algorithm for computing the nested interactive beliefs is crucial to the trade-off between solution quality and computations. To address this issue, we propose methods that utilize Monte Carlo sampling algorithms to obtain approximating solutions to I-POMDPs. Specifically, we use decayed Markov chain Monte Carlo (D-MCMC) (Marthi et al. 2002) to concentrate on sampling beliefs more frequently from their recent past, and use interactive particle filter (I-PF) (Doshi and Gmytrasiewicz 2009) to descend the belief hierarchies and sample them at each level. Applying them to I-POMDPs is nontrivial, since it needs generalization of D-MCMC to multi-agent settings with guaranteed convergence and improvements of I-PF for the infinite time horizon and modeled agent of our interest. Our method significantly mitigates the belief space complexity of I-POMDPs, it works effectively and efficiently on various settings of multi-agent problems. Compared with other sampling methods for I-POMDP such as I-PF, the generalized D-MCMC is competitive, non-divergent, and advantageous on certain application scenarios. 2 Background 2.1 POMDP A Partially observable Markov decision process (POMDP) (Kaelbling, Littman, and Cassandra 1998) is a general model for planning and acting in a single-agent, partially observable, stochastic domain. It is defined for a single agent i as: P OMDP i = S, A i, Ω i, T i, O i, R i (1) where S is the set of states of the environment; A i is the set of agent i s possible actions; Ω i is the set of agent i s possible observations; T i : S A i S [0, 1] is the state transition function; O i : S A i Ω i [0, 1] is the observation function; R i : S A i R i is the reward function. An agent s belief about the state can be represented as a probability distribution over S. The belief update can be simply done using the following formula, where α is the normalizing constant: b (s ) = αo(o, s, a) s S T (s, a, s)b(s) (2) Then the optimal action, a, is part of the set of optimal actions, OP T (b i ), for the belief state defined as: { OP T (b i ) = arg max b i (s)r(s, a i ) + γ } P (o i a i, b i )U(SE(b i, a i, o i )) (3) a i A i s S o i Ω i 2.2 Markov Chain Monte Carlo The Markov Chain Monte Carlo (MCMC) method (Gilks et al., 1996) is widely used to approximate probability distributions when they are unable to be computed directly. It generates samples from a posterior distribution π(x) over state space x, by simulating a Markov chain p(x x) whose state space is x and stationary distribution is π(x). The samples drawn from p converge to the target distribution π as the number of samples goes to infinity. Such a Markov Chain with appropriate stationary distribution can be constructed specifically using Gibbs sampling (Pearl, 1988), it works on the target distribution π(x) where x = (x 1,..., x t ). For a 2
3 Markov chain p(x x), in each iteration we sample i {1,..., t} and sample x i from its conditional distribution π(x i x j : j i), eventually the stationary distribution of p is π. 3 The Model 3.1 I-POMDP framework An interactive POMDP of agent i, I-POMDP i, is defined as: I-P OMDP i = IS i,l, A, Ω i, T i, O i, R i (4) where IS i,l is the set of interactive states of the environment, defined as IS i,l = S M i,l 1, l 1, where S is the set of states and M i,l 1 is the set of possible models of agent j, and l is the strategy level. A specific class of models are the (l 1)th level intentional models, Θ j,l 1, of agent j: θ j,l 1 = b j,l 1, A, Ω j, T j, O j, R j, OC j, b j,l 1 is agent j s belief nested to level (l 1), b j,l 1 (IS j,l 1 ), and OC j is j s optimality criterion. The intentional model θ j,l 1 can be rewritten as θ j,l 1 = b j,l 1, ˆθ j, where ˆθ j includes all elements of the intentional model other than the belief and is called the agent j s frame. IS i,l could be defined in an inductive manner (note that when ˆθ j is usually known, ˆθ j reduces to b j ): IS i,0 = S, θ j,0 = { b j,0, ˆθ j : b j,0 (S)} IS i,1 = S θ j,0, θ j,1 = { b j,1, ˆθ j : b j,1 (IS j,1 )}... IS i,l = S θ j,l 1, θ j,l = { b j,l, ˆθ j : b j,l (IS j,l )} And all other remaining components in an I-POMDP are similar to those in a POMDP: (5) A = A i A j is the set of joint actions of all agents. Ω i is the set of agent i s possible observations. T i : S A i S [0, 1] is the state transition function. O i : S A i Ω i [0, 1] is the observation function. R i : IS A i R i is the reward function. 3.2 Interactive belief update Given all the definitions above, the interactive belief up-date can be performed as follows: b t i(is t ) =P r(is t b t 1 i, a t 1 =α b(is t 1 ) is t 1 o t j i, o t i) a t 1 j P r(a t 1 j θ t 1 j )T (s t 1, a t 1, s t )O(s t, a t 1, o t i) (6) O j (s t, a t 1, o t j)τ(b t 1 j, a t 1 j, o t j, b t j) Unlike plain belief update in POMDP, the interactive belief update in I-POMDP takes two additional sophistications into account. First, the probabilities of other s actions given its models (the second summation) need to be computed since the state of physical environment now depends on both agents actions. Second, the agent needs to update its beliefs based on the anticipation of what observations the other agent might get and how it updates (the third summation). Then the optimal action, a, for the case of infinite horizon criterion with discounting, is part of the set of optimal actions, OP T (θ i ), for the belief state defined as: { OP T (θ i ) = arg max b is (s)er i (is, a i ) + γ P (o i a i, b i )U( SE θi (b i, a i, o i ), ˆθ } i ) a i A i is IS o i Ω i (7) 3
4 4 Decayed Markov Chain Monte Carlo The original Decayed MCMC method was proposed as a filtering algorithm for problems such as dynamic Bayesian networks (Marthi et al. 2002), but it is limited to a single agent perspective. Here we generalize Decayed MCMC to the multi-agent settings by applying it to such models denoted by the Interactive Dynamic Influence Diagram (IDID) (Doshi, Zeng, and Chen 2009), which is shown in figure 1 as an example of two time slices. IDID explicitly models the I-POMDP structure by Figure 1: Two time slices of an IDID, red ones are known. decomposing it into chance and decision variables and the dependencies between them. It captures the essence of the multi-agent interaction problem under the framework of I-POMDPS thus is a perfect way to make computations. Suppose there are two agents i and j, the subscripts in figure 1 denotes corresponding agents. S is the physical state, O j is agent j s observation, A j is j s action and θ j is j s model. The red nodes O i and A i are agent i s action and observation respectively, and they are the only observable variables in this two agent setting. In order to utilize Gibbs samplings, we identify the known nodes of agent i, and sample all unknown variables from their conditional distributions given everything else. However, this can be easily simplified by sampling from the conditionals given their corresponding Markov blankets at different time steps, since any unknown variables are independent of others given their Markov blankets. Then for every t, we just need to sample from the following conditional distributions: o t j P r(o t j s t, θ t 1 j, a t 1, a t 1 j i ) s t P r(s t s t 1, s t+1, a t 1:t j, a t 1:t i, o t j, o t i) (8) a t j P r(a t j θj, t s t, s t+1, a t i, o t+1 j, o t+1 i ) θj t P r(θj θ t t 1 j, a t 1 j, o t j, a t j) For a particular time step t, say t = T, since our goal is the filtering distribution of all hidden variables given observation histories Oi 1:T, only the last hidden variables need to be retained and all previous one can be discarded. And we use the final sample of the chain from last time step to initialize the new Markov chain for sampling the hidden variables at current time step, hoping that the final sample of the previous time step is near the high probability region of the next step. The actual algorithm is as follows, where x (i) t denotes the ith sample of all variables at time t in the network: Algorithm 1: Decayed MCMC Initialize x 0 1:t For i = 1,..., N sample t from some decay function: d(t) Sample x (i) t from p(x t x (i) t), where x (i) t denotes all components of x (i) except x (i) t 4
5 Accordingly, the key of this algorithm is to appropriately choose the time step at which we sample more frequently. Intuitively, since Markov chain has an exponential forgetting effect: x t k has exponentially small effect on p(x t ), instead of picking t uniformly in Gibbs sampling, we can pick t from a decay function d(t) which decays equal to or slower than the forgetting rate, so that the new sampling algorithm favors sampling from recent past and maintains accuracy in the mean time. Notice that the decay function is essentially a probability of updating state x t k, it must satisfy: 1. d(k) > 0 for all k; 2. d(k) decays no greater than the forgetting rate at state x t k. Hence, an inverse polynomial decay d(k) k α (α > 1)should work well for the purpose, since it dominates the exponential forgetting rate asymptotically. Regarding the time complexity, it is usually analyzed in terms of the time cost of each update step of the sampling Markov chain and mixing time of the chain. The former is simply linear of the interactive state space, but the latter will be dependent of time t since samples need to be updated once at each time step in order for the chain to be mixed. Since our aim here is to accurately sample from the marginal distribution of x t, we follow the same approach of the original D-MCMC algorithm, which uses the marginal mixing time to ensure the filtering distribution at current time step is accurate. It is proved that the marginal mixing time for any given observation sequence is O(1), so the total time complexity is O(1) by adding up the update and mixing time, which is independent of time t. 5 Experiments 5.1 Setup We present results for the multi-agent tiger game (Gmytrasiewicz and Doshi 2005) with various parameters. The multi-agent tiger game is a generalization of the classical single agent tiger game (Kaelbling, Littman, and Cassandra 1998) with adding observations which caused by others actions. The generalized multi-agent game contains additional observations regarding other players, while the state transition and reward function also involve others actions as well. Specifically, the game goes as follows: there are a tiger and a pile of gold behind two doors respectively, two players can both listen for the growl of tiger and the creak caused by the other player, or open doors which resets the tiger s location with equal probability. Their observation toward the tiger and the other player are both relatively high (0.85 and 0.9 respectively). No matter caused by which player, the reward for listening action is -1, opening the tiger door is -100 and opening the gold door is 10. Recall that an interactive POMDP of agent i is defined as a six tuple I-P OMDP i = IS i,l, A, Ω i, T i, O i, R i. Thus for the specific setting of multi-agent tiger problem: IS i,1 = S θ j,0, S = {tiger on the left (TL), tiger on the right (TR)} and θ j,0 = b j,0 = {p(t L), p(t R)}, assuming j s frame is known. A = A i A j is all the combinations of each agent s possible actions: listen (L), open left door (OL) and open right door (OR). Ω i is all the combinations of each agent s possible observations: growl from left (GL) or right (GR), combined with creak from left (CL), right (CR) or silence (S). T i = T j : S A i A j S [0, 1] now becomes a joint state transition probability that involves both actions, the tiger s position gets reset to left/right door with 0.5/0.5 whenever an agent opens the door, and remains with 1 when they both listen. O i : S A i A j Ω i [0, 1] becomes a joint observation probability that involves both actions, the observation accuracy of agent i is the accuracy of hearing a growl (0.85) times the accuracy of hearing a creak (0.9). O j is symmetric of O i in terms of the joint actions. R i : IS A i A j R i : agent i gets -1, -100 and 10 when he listens, opens the wrong door and opens the correct door, respectively and independent of j s actions, and vise versa for agent j. 5
6 For the sake of brevity, we restrict the experiments to a two-agent setting and nesting level of one, but the sampling algorithm is extensible to any number of agents and nesting levels in a straightforward way. For the actual experiments, we firstly fix the number of samples used to be 1000 and run it on a two agent tiger game simulation as described above, for comparing the accuracy and efficiency of D-MCMC and I-PF. Then we change the probabilities of tiger s resetting when agents listen and their observation accuracy, in order to test the mixing time of D-MCMC on different problem settings. Lastly we change the sample size and compare the total error rate of predicting others actions to show the consistencies of these two methods. The experiments were running on a 64-bit Windows 10 computer with Intel Core i5-6200u 2.3GHz CPU, 8GB memory, and Matlab R2015b. 5.2 Results Figure 2: Accuracy and efficiency comparisons. The left plot in figure 2 shows the prediction error of agent j s actions over time steps when i s observation accuracy is increase to 0.95 and every other parameter remains the same as in section 5.1. We see that the errors remain bounded for both sampling algorithms, but I-PF can sometimes lose track of the actual belief (around time step and 35-40) when i s observation function is very deterministic, since the agent now tends to be overly sure and the samples collapse. Meanwhile, D-MCMC does not suffer such a problem on this situation. The right plot in figure 2 is a running time comparison on the standard tiger game settings described in section 5.1. Although running time of both methods are independent of t, when the two methods use equal number of samples (1000), I-PF is slightly efficient than D-MCMC due to its recursive nature that the samples drawn at previous time step are reused at next step. On the contrary, D-MCMC is non-recursive, consults from part of the history, and also needs some mixing time at each time step before the samples from the sampling Markov chain can be actually used. However, the difference of their time complexity is up to a constant factor, thus D-MCMC is still a better choice when dealing with observation outliers. Figure 3 shows some important intrinsic properties of D-MCMC on I-POMDP. The left plot is the mixing time of D-MCMC over observation history. For tiger game with different transitions (tiger remains with 0.5 and 0.9 when door opens) and observation functions (hearing accuracy of 0.65 and 0.85), this experiment shows that mixing time of D-MCMC increases with transition determinism and decrease with observation determinism, since the increasing importance of the observations means history becomes less relevant. However, the mixing time remains bounded on both sampling methods as time step increases. The right plot in figure 3 shows the bounded total error rates (consistency) for the standard tiger game settings as the number of samples increases. We observed that the error rate of D-MCMC drops faster than I-PF until they become very close after using roughly 1000 samples, and D-MCMC remains slightly lower error rate that I-PF due to its better performance when confronting outliers. Intuitively 6
7 Figure 3: Mixing time and consistency comparisons. D-MCMC consults from history and current observation, in which it should receive more information that leads to a higher accuracy. Table 1: Comparisons of major differences between I-PF and D-MCMC Name Samples Resampling RecursionDivergence Major Drawbacks D-MCMC I-PF entire sate trajectories most recent state resample state at any time only resample current state nonrecursivdivergent non- recursive sometimes divergent incapable of determinism; slightly slower high-dimensional inefficient; errors propagate forward Lastly, we give a detailed comparison in table 1 regarding important aspects of D-MCMC and I-PF. To summarize, when dealing with a complicated problem which might generate frequent observation outliers, the D-MCMC is guaranteed to be non-divergent while I-PF may temporarily lose track of the real posterior distribution. A complex, high dimensional state space or a tight observation model could be evident signs of such scenarios. However, there is a trade-off between accuracy and efficiency, since D-MCMC tackles these scenarios at the cost of sampling from recent history and also consumes a small amount of time for the burn-in period. 6 Conclusions and Future Work We have described a new method to approximate the belief update in I-POMDP settings and used it to sample from the interactive beliefs. The results show that our approach mitigates the belief space complexity, tackles observation outliers, and competes with other sampling algorithms in terms of the prediction accuracy toward others actions. Although empirical results have shown that D-MCMC is non-divergent, in the future we will formally prove its convergence on I-POMDPs. Also more examples with high dimensionality state space can be tested, in which situation D-MCMC should outperform I-PF since the latter suffers a long recovering time using the state transition model as the dimension increases. References Doshi, P., and Gmytrasiewicz, P. J Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research 34: Doshi-Velez, F. and Konidaris, G., Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. arxiv preprint arxiv: Doshi, P., Zeng, Y., and Chen, Q Graphical models for interactive POMDPs: representations and solutions. Autonomous Agents and Multi-Agent Systems 18.3: Gmytrasiewicz, P. J., and Doshi, P A 7
8 framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research 24(1): Gilks, W.R., Richardson, S. and Spiegelhalter, D.J., Introducing markov chain monte carlo. Markov chain Monte Carlo in practice, 1, p.19. Gmytrasiewicz, P. J., and Doshi, P A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research 24(1): Kaelbling, L.P., Littman, M.L. and Cassandra, A.R., Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1), pp Liu, M., Liao, X. and Carin, L., The infinite regionalized policy representation. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp ). Marthi, B., Pasula, H., Russell, S. and Peres, Y., Decayed MCMC filtering. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, Panella, A. and Gmytrasiewicz, P., 2016, March. Bayesian Learning of Other Agents Finite Controllers for Interactive POMDPs. In Thirtieth AAAI Conference on Artificial Intelligence. Papadimitriou, C.H. and Tsitsiklis, J.N., The complexity of Markov decision processes. Mathematics of operations research, 12(3), pp Pearl, J., Probabilistic reasoning in intelligent systems: Networks of plausible reasoning. Ross, S., Chaib-draa, B. and Pineau, J., Bayes-adaptive pomdps. In Advances in neural information processing systems (pp ). 8
STA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationA Partition-Based First-Order Probabilistic Logic to Represent Interactive Beliefs
A Partition-Based First-Order Probabilistic Logic to Represent Interactive Beliefs Alessandro Panella and Piotr Gmytrasiewicz Fifth International Conference on Scalable Uncertainty Management Dayton, OH
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationLearning Static Parameters in Stochastic Processes
Learning Static Parameters in Stochastic Processes Bharath Ramsundar December 14, 2012 1 Introduction Consider a Markovian stochastic process X T evolving (perhaps nonlinearly) over time variable T. We
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationCombine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning
Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business
More informationarxiv: v1 [cs.ai] 14 Nov 2018
Bayesian Reinforcement Learning in Factored POMDPs arxiv:1811.05612v1 [cs.ai] 14 Nov 2018 ABSTRACT Sammie Katt Northeastern University Bayesian approaches provide a principled solution to the explorationexploitation
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationReinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies
Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University
More informationEfficient Maximization in Solving POMDPs
Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More informationBias-Variance Error Bounds for Temporal Difference Updates
Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationPartially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS
Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal
More information10 Robotic Exploration and Information Gathering
NAVARCH/EECS 568, ROB 530 - Winter 2018 10 Robotic Exploration and Information Gathering Maani Ghaffari April 2, 2018 Robotic Information Gathering: Exploration and Monitoring In information gathering
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationSatisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games
Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationEuropean Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University
European Workshop on Reinforcement Learning 2013 A POMDP Tutorial Joelle Pineau McGill University (With many slides & pictures from Mauricio Araya-Lopez and others.) August 2013 Sequential decision-making
More informationPlanning and Acting in Partially Observable Stochastic Domains
Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationSequential Decision Problems
Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More informationBayesian Networks BY: MOHAMAD ALSABBAGH
Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationOutline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012
CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationHuman-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg
Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning
More informationBayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison
Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Michaël Castronovo University of Liège, Institut Montefiore, B28, B-4000 Liège, BELGIUM Damien Ernst
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationExpectation propagation for signal detection in flat-fading channels
Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More information28 : Approximate Inference - Distributed MCMC
10-708: Probabilistic Graphical Models, Spring 2015 28 : Approximate Inference - Distributed MCMC Lecturer: Avinava Dubey Scribes: Hakim Sidahmed, Aman Gupta 1 Introduction For many interesting problems,
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and
More informationOn Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets
On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationSequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007
Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationCS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability
CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationOblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games Gabriel Y. Weintraub, Lanier Benkard, and Benjamin Van Roy Stanford University {gweintra,lanierb,bvr}@stanford.edu Abstract
More informationPlanning Under Uncertainty II
Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationMarkov chain Monte Carlo methods for visual tracking
Markov chain Monte Carlo methods for visual tracking Ray Luo rluo@cory.eecs.berkeley.edu Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationApril 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning
for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Bayes Nets: Sampling Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationRobust Monte Carlo Methods for Sequential Planning and Decision Making
Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationOptimizing Memory-Bounded Controllers for Decentralized POMDPs
Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationRL 3: Reinforcement Learning
RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationHidden Markov Models (recap BNs)
Probabilistic reasoning over time - Hidden Markov Models (recap BNs) Applied artificial intelligence (EDA132) Lecture 10 2016-02-17 Elin A. Topp Material based on course book, chapter 15 1 A robot s view
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationMinimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP
Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Jiaying Shen Department of Computer Science University of Massachusetts Amherst, MA 0003-460, USA jyshen@cs.umass.edu
More information