Optimally Solving Dec-POMDPs as Continuous-State MDPs

Similar documents
Optimally Solving Dec-POMDPs as Continuous-State MDPs

Producing Efficient Error-bounded Solutions for Transition Independent Decentralized MDPs

Decentralized Decision Making!

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimally solving Dec-POMDPs as Continuous-State MDPs: Theory and Algorithms

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

Decision-theoretic approaches to planning, coordination and communication in multiagent systems

Partially Observable Markov Decision Processes (POMDPs)

Grundlagen der Künstlichen Intelligenz

Optimal and Approximate Q-value Functions for Decentralized POMDPs

Reinforcement Learning. Introduction

RL 14: POMDPs continued

Artificial Intelligence

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Information Gathering and Reward Exploitation of Subgoals for P

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

IAS. Dec-POMDPs as Non-Observable MDPs. Frans A. Oliehoek 1 and Christopher Amato 2

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

An Introduction to Markov Decision Processes. MDP Tutorial - 1

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

arxiv: v2 [cs.ai] 20 Dec 2014

Multi-Agent Online Planning with Communication

Tree-based Solution Methods for Multiagent POMDPs with Delayed Communication

Decision Theory: Markov Decision Processes

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

CS 7180: Behavioral Modeling and Decisionmaking

Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs

Heuristic Search Value Iteration for POMDPs

Markov decision processes

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Accelerated Vector Pruning for Optimal POMDP Solvers

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP

Probabilistic Planning. George Konidaris

Bayes-Adaptive POMDPs 1

Pruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs

Factored State Spaces 3/2/178

Exploiting Coordination Locales in Distributed POMDPs via Social Model Shaping

Discrete planning (an introduction)

Decision Theory: Q-Learning

Learning for Multiagent Decentralized Control in Large Partially Observable Stochastic Environments

A Concise Introduction to Decentralized POMDPs

Decision Making As An Optimization Problem

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Exploiting Symmetries for Single and Multi-Agent Partially Observable Stochastic Domains

Interactive POMDP Lite: Towards Practical Planning to Predict and Exploit Intentions for Interacting with Self-Interested Agents

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Exploiting Coordination Locales in Distributed POMDPs via Social Model Shaping

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Planning by Probabilistic Inference

Markov Decision Processes

COG-DICE: An Algorithm for Solving Continuous-Observation Dec-POMDPs

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Basics of reinforcement learning

Efficient Maximization in Solving POMDPs

An Analytic Solution to Discrete Bayesian Reinforcement Learning

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam

Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis

Reinforcement Learning

Planning Under Uncertainty II

Learning to Act in Decentralized Partially Observable MDPs

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning and NLP

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Chapter 16 Planning Based on Markov Decision Processes

, and rewards and transition matrices as shown below:

SpringerBriefs in Intelligent Systems

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Planning and Acting in Partially Observable Stochastic Domains

CS 4649/7649 Robot Intelligence: Planning

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

THE COMPLEXITY OF DECENTRALIZED CONTROL OF MARKOV DECISION PROCESSES

Control Theory : Course Summary

RL 14: Simplifications of POMDPs

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Towards Faster Planning with Continuous Resources in Stochastic Domains

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Artificial Intelligence & Sequential Decision Problems

Fifth Workshop on Multi-agent Sequential Decision Making in Uncertain Domains (MSDM) Toronto, Canada

Partially-Synchronized DEC-MDPs in Dynamic Mechanism Design

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Artificial Intelligence

Active Learning of MDP models

Learning in Zero-Sum Team Markov Games using Factored Value Functions

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Point-Based Value Iteration for Constrained POMDPs

Introduction to Sequential Teams

Transcription:

Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI August 8, 2013 Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 1 / 20

Outline Outline 1 Background Overview Decentralized POMDPs Existing methods 2 Dec-POMDPs as continuous-state MDPs Overview Solving the occupancy MDP Exploiting multiagent structure 3 Experiments 4 Conclusion Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 2 / 20

Background Overview General overview Agents situated in a world, receiving information and choosing actions Uncertainty about outcomes and sensors Sequential domains Cooperative multi-agent Decision-theoretic approach Developing approaches that scale to real-world domains Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 3 / 20

Background Overview Cooperative multiagent problems Each agent s choice affects all others, but must be made using only local information Communication may be costly, slow or noisy Domains of interest robotics, disaster response, networks,... Surveillance Area Communication Area Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 Base Area 4 / 20

Background Decentralized POMDPs Multi-Agent Decision Making Under Uncertainty Decentralized partially observable Markov decision process (Dec-POMDP) Sequential decision-making At each stage, each agent takes an action and receives: A local observation A joint immediate reward a 1 r o 1 a n Environment o n Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 5 / 20

Background Decentralized POMDPs Multi-Agent Decision Making Under Uncertainty Dec-POMDP definition Dec-POMDP I, S, {A i }, {Z i }, p, r, o, b 0, T I, a finite set of agents S, a finite set of states A i, each agent s finite set of actions Z i, each agent s finite set of observations p, the state transition model: Pr(s s, a) o, the observation model: Pr( o s, a) r, the reward model: R(s, a) b 0, initial state distribution T, planning horizon Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 6 / 20

Dec-POMDP solutions Background Decentralized POMDPs History θi t = ai 0, oi 1,..., a t 1 i, oi t Local policy: each agent maps histories to actions, π i : Θ i A i State is unknown, so beneficial to remember history π i, a sequence of decision rules π i = πi 0,..., π T 1 i mapping histories to actions, πi t (θi t ) = a i Joint policy π = π 1,..., π n with individual (local) agent policies π i Goal is to maximize expected cumulative reward over a finite horizon a 1 o 1 o 2 a 1 o 1 o 2 a 2 a 3 a 2 a 3 o 1 o 2 o 1 o 2 o 1 o 2 o 1 o 2 a 3 a 2 a 1 a 1 a 3 a 2 a 1 a 1 Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 7 / 20

POMDPs Background a o,r Decentralized POMDPs Environment Subclass of Dec-POMDPs with only one agent Agent maintain s belief state (distributions over states) Policy = mapping from histories or belief states π : B A Can solve a POMDP as a continuous-state belief MDP V π (b) = R(b, a) + o Pr(b b, a, o) Pr(o b, a)v π (b ) Structure: piecewise linear convex (PWLC) value function s 1 s 2 S Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 8 / 20

Background Decentralized POMDPs Example: 2-Agent Navigation Meeting in a grid States: grid cell pairs Actions: move,,,, stay Transitions: noisy Observations: red lines Rewards: negative unless sharing the same square Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 9 / 20

Background Decentralized POMDPs Challenges in solving Dec-POMDPs Partial observability makes the problem difficult to solve No common state estimate (centralized belief state) or concise sufficient statistic Each agent depends on the others Can t directly transform Dec-POMDPs into a continuous-state MDP from a single agent s perspective Therefore, Dec-POMDPs are fundamentally different and more complex (NEXP instead of PSPACE) Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 10 / 20

Current methods Background Existing methods Assume an offline planning phase that is centralized Generate explicit policy representations (trees) for each agent Search bottom up (DP) or top down (heuristic search) Often use game-theoretic ideas from the perspective of a single agent Search in the space of policies for the optimal set Top down Bottom up t=0 a1 a1 o1 o2 o1 o2 t=1 a1 a2 a1 a2 o1 o2 o1 o2 o1 o2 o1 o2 t=2 a1 a2 a2 a2 a1 a2 a2 a2 old new Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 11 / 20

Dec-POMDPs as continuous-state MDPs Overview Overview of our approach Current methods don t take full advantage of centralized planning phase Overview Push common information into an occupancy state Move local information into action selection as decision rules Formalize Dec-POMDPs as continuous-state MDPs with a PWLC value function Exploit multiagent structure in representation, making it scalable This doesn t use explicit policy representations or construct policies from a single agent s perspective Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 12 / 20

Dec-POMDPs as continuous-state MDPs Overview Centralized Sufficient Statistic Policy π, sequence of decentralized decision rules, π = π 0,..., π T 1 Joint history θ t = θ t 1,..., θ t n, with π t (θ t ) = a 1,..., a n a3 a2 o1 a1 o2 a3 o1 o2 o1 o2 a2 a1 a1 a3 a2 o1 a1 o2 a3 o1 o2 o1 o2 a2 a1 a1 An occupancy state is a distribution η(s, θ t ) = Pr(s, θ t π 0:t 1, b 0 ) The occupancy state is a sufficient statistic: Can optimize future policy π t:t over η rather than initial belief and past joint policies Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 13 / 20

Dec-POMDPs as continuous-state MDPs Overview Dec-POMDPs as continuous-state MDPs Occupancy state η t (s, θ t ) = Pr(s, θ t π 0:t 1, η 0 ) with η 0 = b 0 Transform Dec-POMDP into a continuous-state MDP s MDP : η a MDP : π t (decentralized decision rules) T MDP : Pr(η t π t 1, η t 1 ) Deterministic with P(η t, π t ) = η t+1 R MDP : s,θ t η t (s, θ t )R(s, π t (θ t )) Centralized sufficient statistic (the occupancy state) Decision rules ensure decentralization Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 14 / 20

Dec-POMDPs as continuous-state MDPs Overview Piecewise linear convexity Bellman optimality operator: V t (η t ) = max π t D R MDP(η t, π t ) + V t+1(p(η t, π t )) 1- Operator preserves PWLC property (piecewise linearity and convexity) 2- R MDP (η t, π t ) is linear PWLC value function SxΘ t POMDP algorithms can be used! Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 15 / 20

Dec-POMDPs as continuous-state MDPs Solving the occupancy MDP Solving the occupancy MDP Feature-based heuristic search value iteration (FB-HSVI) Based on heuristic search value iteration (Smith and Simmons, UAI 04) Sample occupancy distributions starting from the initial occupancy Update upper bounds based on decision rules (on the way down) Update lower bounds (on the way back up) Stop when bounds converge for initial occupancy υ(η) η, β lower bound Λ = {β k } k β 0 β1 β 2 η ῡ(η) 0.2 v1 + 0.8 v2 upper bound Γ = {(η k, v k )} k (η 1, v 1 ) (η 2, v 2 ) η = 0.2 η 1 + 0.8 η 2 Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 16 / 20

Scaling up Dec-POMDPs as continuous-state MDPs Exploiting multiagent structure The occupancy MDP has very large action and state spaces Two key ideas to deal with these combinatorial explosions: 1 State reduction through history compression Compress histories of the same length (Oliehoek et al., JAIR 13) Reduce history length without loss 2 More efficient action selection Generating a greedy decision rule for an occupancy state as a weighted constraint satisfaction problem Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 17 / 20

Experiments Experiments Tested 3 versions of our algorithm Algorithm 0: HSVI with occupancy MDP Algorithm 1: HSVI with efficient action selection Algorithm 2: HSVI with efficient action selection + feature-based state space Comparison algorithms Forward search: GMAA*-ICE (Spaan et al., IJCAI 2011) Dynamic programming: IPG (Amato et al., ICAPS 2009), LPC (Boularias and Chaib-draa, ICAPS 2008) Optimization: MILP (Aras and Dutech, JAIR 2010) Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 18 / 20

Experiments Experiments Optimal υ within ε = 0.01 The multi-agent tiger problem ( S =2, Z =4, A =9, K =3) T MILP LPC IPG ICE FB-HSVI( ) ( 0 ) 0 1 2 2 0.17 0.32 0.01 0.05 0.03 0.03 4.00 3 4.9 1.79 55.4 0.01 2.17 0.06 0.40 5.1908 4 72 534 2286 108 9164 2.66 1.36 4.8027 5 347 22.2 9.65 7.0264 6 171.3 24.42 10.381 7 33.11 9.9935 8 41.21 12.217 9 58.51 15.572 10 65.57 15.184 The recycling-robot problem ( S =4, Z =4, A =9, K =1) 2 0.30 36 0.03 0.02 0.01 7.000 3 1.07 36 0.05 0.47 0.10 10.660 4 42.0 72 0.85 0.65 0.30 13.380 5 1812 72 1.52 0.87 0.34 16.486 10 5.06 2.83 0.52 31.863 30 62.8 37.9 1.13 93.402 70 78.1 2.13 216.47 100 259 2.93 308.78 The mars-rovers problem ( S =256, Z =81, A =36, K =3) 2 83 1.0 0.21 0.09 0.10 5.80 3 389 1.0 2.84 0.21 0.23 9.38 4 103 104.2 1.73 0.47 10.18 5 6.38 0.82 13.26 6 8.16 3.97 18.62 7 11.13 5.81 20.90 8 35.49 22.8 22.47 9 57.47 26.5 24.31 10 316.2 62.7 26.31 Table 2: Experiments comparing the computation times (in Time and value on benchmarks Blank space = algorithm over time (200s) Red for fastest and previously unsolvable horizons K is the largest history window Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 19 / 20 used

Conclusion Conclusion Summary Dec-POMDPs are powerful multiagent models Formulated Dec-POMDPs as continuous-state MDPs with PWLC value function POMDP (and continuous MDP) methods can now be applied Can also take advantage of multiagent structure in the problem Our approach shows significantly improved scalability Future work Approximate solutions (bounds on the solution quality) More concise statistics Subclasses like TI Dec-MDPs in our AAMAS-13 paper Just observation histories as in Oliehoek, IJCAI 13 Dibangoye, Amato, Buffet and Charpillet Optimally Solving Dec-POMDPs as MDPs August 8, 2013 20 / 20