Decision-theoretic approaches to planning, coordination and communication in multiagent systems

Size: px

Start display at page:

Download "Decision-theoretic approaches to planning, coordination and communication in multiagent systems"

Egbert Morgan
5 years ago
Views:

1 Decision-theoretic approaches to planning, coordination and communication in multiagent systems Matthijs Spaan Frans Oliehoek 2 Stefan Witwicki 3 Delft University of Technology 2 U. of Liverpool & U. of Amsterdam 3 EPFL ESSENCE Autumn School Ischia, Italy October 29, 24

2 Introduction

3 Introduction Goal in Artificial Intelligence: to build intelligent agents. Our definition of intelligent : perform an assigned task as well as possible. Problem: how to act? We will explicitly model uncertainty.

4 Motivation Intelligent distributed systems are becoming ubiquitous: Smart energy grid infrastructure Surveillance camera networks Autonomous guided vehicles, vehicular networks Internet or smart phone applications Devices can sense, compute, act and interact.

5 Motivation Intelligent distributed systems are becoming ubiquitous: Smart energy grid infrastructure Surveillance camera networks Autonomous guided vehicles, vehicular networks Internet or smart phone applications Devices can sense, compute, act and interact. Growing need for scalable and flexible multiagent planning. Key issue: uncertainty

6 Outline Before the break:. Introduction to decision making under uncertainty 2. Planning under action uncertainty (MDPs) 3. Planning under sensing uncertainty (POMDPs) After the break:. Multiagent planning (Dec-POMDPs) 2. Communication

7 Agents An agent is a (rational) decision maker who is able to perceive its external (physical) environment and act autonomously upon it (Russell and Norvig, 23). Rationality means reaching the optimum of a performance measure. Examples: humans, robots, some software programs.

8 Agents environment action agent state observation It is useful to think of agents as being involved in a perception-action loop with their environment. But how do we make the right decisions?

9 Planning Planning: A plan tells an agent how to act. For instance A sequence of actions to reach a goal. What to do in a particular situation. We need to model: the agent s actions its environment its task We will model planning as a sequence of decisions.

10 Classic planning Classic planning: sequence of actions from start to goal. Task: robot should get to gold as quickly as possible. Actions: Limitations: New plan for each start state. Environment is deterministic.

11 Classic planning Classic planning: sequence of actions from start to goal. Task: robot should get to gold as quickly as possible. Actions: Limitations: New plan for each start state. Environment is deterministic. Three optimal plans:,,.

12 Conditional planning Assume our robot has noisy actions (wheel slip, overshoot). We need conditional plans. Map situations to actions.

13 Decision-theoretic planning Positive reward when reaching goal, small penalty for all other actions. Agent s plan maximizes value: the sum of future rewards. Decision-theoretic planning successfully handles noise in acting and sensing.

14 Decision-theoretic planning Plan #: Reward:

15 Decision-theoretic planning Values of this plan: Reward:

16 Decision-theoretic planning Values of this plan: Reward:

17 Decision-theoretic planning Plan #2: Reward:

18 Decision-theoretic planning Values of this plan: Reward:

19 Decision-theoretic planning Values of this plan: Reward:

20 Decision-theoretic planning Optimal values (encode optimal plan): Reward:

21 Markov Decision Processes

Main assumptions: Sequential decisions: problems are formulated as a sequence of independent decisions;

22 Sequential decision making under uncertainty Uncertainty is abundant in real-world planning domains. Bayesian approach probabilistic models. Main assumptions: Sequential decisions: problems are formulated as a sequence of independent decisions; Markovian environment: the state at time t depends only on the events at time t ; Evaluative feedback: use of a reinforcement signal as performance measure (reinforcement learning);

23 Transition model For instance, robot motion is inaccurate. Transitions between states are stochastic. p(s s, a) is the probability to jump from state s to state s after taking action a.?????

24 MDP Agent environment π action a obs. s reward r state s

25 MDP Agent environment π action a p(s s, a) obs. s reward r state s

26 MDP Agent environment π action a R(s, a) obs. s reward r state s

27 Optimality criterion For instance, agent should maximize the value [ h E γ t R t ], () t= where h is the planning horizon, can be finite or γ is a discount rate, γ < Reward hypothesis (Sutton and Barto, 998): All goals and purposes can be formulated as the maximization of the cumulative sum of a received scalar signal (reward).

28 Discrete MDP model Discrete Markov Decision Process model (Puterman, 994; Bertsekas, 2): Time t is discrete. State space S. Set of actions A. Reward function R : S A R. Transition model p(s s, a), T a : S A (S). Initial state s is drawn from (S). The Markov property entails that the next state s t+ only depends on the previous state s t and action a t : p(s t+ s t, s t,...,s, a t, a t,...,a ) = p(s t+ s t, a t ). (2)

29 A simple problem Problem: An autonomous robot must learn how to transport material from a deposit to a building facility. Load/Unload 2 3 Material Move Left/Move Right Building facility (thanks to F. Melo)

30 Load/Unload as an MDP Load/Unload 2 3 Material Move Left/Move Right Building facility States: S = { U, 2 U, 3 U, L, 2 L, 3 L }; U Robot in position (unloaded); 2 U Robot in position 2 (unloaded); 3 U Robot in position 3 (unloaded); L Robot in position (loaded); 2 L Robot in position 2 (loaded); 3 L Robot in position 3 (loaded) Actions: A = {Left, Right, Load, Unload};

31 Load/Unload as an MDP () Transition probabilities: Left / Right move the robot in the corresponding direction; Load loads material (only in position ); Unload unloads material (only in position 3). Ex: (2 L, Right) 3 L ; (3 L, Unload) 3 U ; ( L, Unload) L. Reward: We assign a reward of + for every unloaded package (payment for the service).

32 Load/Unload as an MDP (2) For each action a A, T a is a matrix. Ex: T Right = Recall: S = { U, 2 U, 3 U, L, 2 L, 3 L }.

33 Load/Unload as an MDP (3) The reward R(s, a) can also be represented as a matrix Ex: R = + S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

34 Policies and value Policy π: tells the agent how to act. A deterministic policy π : S A is a mapping from states to actions. Value: how much reward E[ h t= γt R t ] does the agent expect to gather. Value denoted as Q π (s, a): start in s, do a and follow π afterwards.

35 Policies and value () Extracting a policy π from a value function Q is easy: π(s) = arg max Q(s, a). (3) a A Optimal policy π : one that maximizes E[ h t= γt R t ] (for every state). In an infinite-horizon MDP there is always an optimal deterministic stationary (time-independent) policy π. There can be many optimal policies π, but they all share the same optimal value function Q.

36 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

37 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

38 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

39 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

40 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

41 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

42 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

43 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

44 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

45 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q 2 =???????????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

46 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q 2 =??? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

47 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q 2 = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

48 Dynamic programming Iterations of dynamic programming (γ =.95): Q 5 = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

49 Dynamic programming Iterations of DP: Q 2 = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

50 Dynamic programming Final Q and policy: Q = π = Load Left Left Right Right Unload

51 Value iteration Value iteration: successive approximation technique. Start with all values set to. In order to consider one step deeper into the future, i.e., to compute V n+ from V n : Q n+ (s, a) := R(s, a)+γ s S p(s s, a) max a A Q n(s, a ), (4) which is known as the dynamic programming update or Bellman backup. Bellman (957) equation: Q (s, a) = R(s, a)+γ s S p(s s, a) max a A Q (s, a ). (5)

52 Value iteration Value iteration discussion: As n, value iteration converges. Value iteration has converged when the largest update δ in an iteration is below a certain threshold ǫ. Exhaustive sweeps are not required for convergence, provided that in the limit all states are visited infinitely often. This can be exploited by backing up the most promising states first, known as prioritized sweeping.

53 Q-learning Reinforcement-learning techniques learn from experience, no knowledge of the model is required. Q-learning update: [ ] Q(s, a) = ( β) Q(s, a)+β r +γmax a A Q(s, a ), (6) where < β is a learning rate.

54 Q-learning Q-learning discussion: Q-learning is guaranteed to converge to the optimal Q-values if all Q(s, a) values are updated infinitely often. In order to make sure all actions will eventually be tried in all states exploration is necessary. A common exploration method is to execute a random action with small probability ǫ, which is known as ǫ-greedy exploration.

55 Solution methods: MDPs Model based Basic: dynamic programming (Bellman, 957), value iteration, policy iteration. Advanced: prioritized sweeping, function approximators. Model free, reinforcement learning (Sutton and Barto, 998) Basic: Q-learning, TD(λ), SARSA, actor-critic. Advanced: generalization in infinite state spaces, exploration/exploitation issues.

56 POMDPs

57 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance)

58 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation

59 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring

60 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring Dialog systems

61 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring Dialog systems Vision-based robotics

62 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring Dialog systems Vision-based robotics Fault recovery

63 Observation model Imperfect sensors. Partially observable environment: Sensors are noisy. Sensors have a limited view. p(o s, a) is the probability the agent receives observation o in state s after taking action a.

64 POMDP Agent environment π action a obs. o reward r state s

65 POMDP Agent environment π action a p(s s, a) obs. o reward r state s

66 POMDP Agent environment π action a p(o s, a) obs. o reward r state s

67 POMDP Agent environment π action a R(s, a) obs. o reward r state s

68 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty.

69 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O.

70 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions.

71 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states.

72 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a).

73 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a). A planning horizon h (finite or ).

74 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a). A planning horizon h (finite or ). A discount rate γ <.

75 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a). A planning horizon h (finite or ). A discount rate γ <. Goal is to compute plan, or policy π, that maximizes long-term reward.

76 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob

77 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic POMDP: memory-based (optimal) Value

78 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic POMDP: memory-based (optimal) Value V = t= γt r = r γ

79 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic POMDP: memory-based (optimal) Value V = t= γt r = r γ V max = r γr γ

80 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic V = POMDP: memory-based (optimal) Value V = t= γt r = r γ V max = r γr γ

81 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic V = POMDP: memory-based (optimal) V min = γr γ r Value V = t= γt r = r γ V max = r γr γ

82 Beliefs Beliefs: The agent maintains a belief b(s) of being at state s. After action a A and observation o O the belief b(s) can be updated using Bayes rule: b (s ) p(o s ) s p(s s, a)b(s) The belief vector is a Markov signal for the planning task.

83 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

84 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

85 Belief update example True situation: Robot s belief:.5.25? Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

86 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

87 Belief update example True situation: Robot s belief:.5.25? Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

88 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

89 MDP-based algorithms Exploit belief state, and use the MDP solution as a heuristic. Most likely state (Cassandra et al., 996): π MLS (b) = π (arg max s b(s)). Q MDP (Littman et al., 995): π QMDP (b) = arg max a s b(s)q (s, a). C a b a A b b I a.5.5 A c b c + D a (Parr and Russell, 995)

90 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S.

91 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S. Stochastic Markovian transition model p(b o a b, a) = p(o b, a). This is the normalizer of Bayes rule.

92 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S. Stochastic Markovian transition model p(b o a b, a) = p(o b, a). This is the normalizer of Bayes rule. Reward function R(b, a) = s R(s, a)b(s). This is the average reward with respect to b(s).

93 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S. Stochastic Markovian transition model p(b o a b, a) = p(o b, a). This is the normalizer of Bayes rule. Reward function R(b, a) = s R(s, a)b(s). This is the average reward with respect to b(s). The robot fully observes the new belief-state b o a after executing a and observing o.

94 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions.

95 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions. The optimal value V of a POMDP satisfies the Bellman optimality equation V = HV : [ V (b) = max R(b, a)+γ ] p(o b, a)v (b o a a) o

96 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions. The optimal value V of a POMDP satisfies the Bellman optimality equation V = HV : [ V (b) = max R(b, a)+γ ] p(o b, a)v (b o a a) o Value iteration repeatedly applies V n+ = HV n starting from an initial V.

97 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions. The optimal value V of a POMDP satisfies the Bellman optimality equation V = HV : [ V (b) = max R(b, a)+γ ] p(o b, a)v (b o a a) o Value iteration repeatedly applies V n+ = HV n starting from an initial V. Computing the optimal value function is a hard problem (PSPACE-complete for finite horizon, undecidable for infinite horizon).

98 Example V R α.5 α 2 α 3 (,) b (,) R(s, a) a a 2 a 3 s s

99 PWLC shape of V n Like V, V n is as well piecewise linear and convex. Rewards R(b, a) = b R(s, a) are linear functions of b. Note that the value of a point b satisfies: V n+ (b) = max a [ b R(s, a)+γ p(o b, a)v n (ba) o ] which involves a maximization over (at least) the vectors R(s, a). Intuitively: less uncertainty about the state (low-entropy beliefs) means better decisions (thus higher value). o

100 Exact value iteration Value iteration computes a sequence of value function estimates V, V 2,...,V n, using the POMDP backup operator H, V n+ = HV n. V V 3 V 2 V (,) (,)

101 Optimal value functions The optimal value function of a (finite-horizon) POMDP is piecewise linear and convex: V(b) = max α b α. V α α 2 α 3 α 4 (,) (,)

102 Vector pruning V α α 2 α 5 α 3 α 4 b b 2 (,) (,) Linear program for pruning: variables: s S, b(s); x maximize: x subject to: b (α α ) x, α V,α α b (S)

103 Optimal POMDP methods Enumerate and prune: Most straightforward: Monahan (982) s enumeration algorithm. Generates a maximum of A V n O vectors at each iteration, hence requires pruning. Incremental pruning (Zhang and Liu, 996; Cassandra et al., 997). Search for witness points: One Pass (Sondik, 97; Smallwood and Sondik, 973). Relaxed Region, Linear Support (Cheng, 988). Witness (Cassandra et al., 994).

104 Sub-optimal techniques Grid-based approximations (Drake, 962; Lovejoy, 99; Brafman, 997; Zhou and Hansen, 2; Bonet, 22). Optimizing finite-state controllers (Platzman, 98; Hansen, 998b; Poupart and Boutilier, 24). Heuristic search in the belief tree (Satia and Lave, 973; Hansen, 998a). Compression or clustering (Roy et al., 25; Poupart and Boutilier, 23; Virin et al., 27). Point-based techniques (Pineau et al., 23; Smith and Simmons, 24; Spaan and Vlassis, 25; Shani et al., 27; Kurniawati et al., 28). Monte Carlo tree search (Silver and Veness, 2).

105 Point-based backup For finite horizon V is piecewise linear and convex, and for infinite horizons V can be approximated arbitrary well by a PWLC value function (Smallwood and Sondik, 973).

106 Point-based backup For finite horizon V is piecewise linear and convex, and for infinite horizons V can be approximated arbitrary well by a PWLC value function (Smallwood and Sondik, 973). Given value function V n and a particular belief point b we can easily compute the vector α b n+ of HV n such that α b n+ = arg max b αn+ k, {α k n+ } k where {α k n+ } HVn k= is the (unknown) set of vectors for HV n. We will denote this operation α b n+ =backup(b).

107 Point-based (approximate) methods Point-based (approximate) value iteration plans only on a limited set of reachable belief points:. Let the robot explore the environment. 2. Collect a set B of belief points. 3. Run approximate value iteration on B.

108 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

109 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 2 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

110 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 2 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

111 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 2 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

112 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

113 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

114 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

115 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

116 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

117 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

118 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

119 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

120 Sub-optimal techniques Grid-based approximations (Drake, 962; Lovejoy, 99; Brafman, 997; Zhou and Hansen, 2; Bonet, 22). Optimizing finite-state controllers (Platzman, 98; Hansen, 998b; Poupart and Boutilier, 24). Heuristic search in the belief tree (Satia and Lave, 973; Hansen, 998a). Compression or clustering (Roy et al., 25; Poupart and Boutilier, 23; Virin et al., 27). Point-based techniques (Pineau et al., 23; Smith and Simmons, 24; Spaan and Vlassis, 25; Shani et al., 27; Kurniawati et al., 28). Monte Carlo tree search (Silver and Veness, 2).

121 Multiagent Planning

122 Multiagent Systems (MASs) Why MASs? If we can make intelligent agents, soon there will be many... Physically distributed systems: centralized solutions expensive and brittle. can potentially provide [Vlassis, 27,Sycara, 998] Speedup and efficiency Robustness and reliability ( graceful degradation ) Scalability and flexibility (adding additional agents)

123 Example: Predator-Prey Domain Predator-Prey domain still single agent! agent: the predator (blue) prey (red) is part of the environment on a torus ('wrap around world') Formalization: states (-3,4) actions N,W,S,E transitions probabilty of failing to move, prey moves rewards reward for capturing?

124 Example: Predator-Prey Domain Predator-Prey domain agent: the predator (blue) prey (red) is part of the environment on a torus ('wrap around world') Formalization: states (-3,4) actions N,W,S,E transitions probability of failing to move, prey moves rewards reward for capturing

125 Example: Predator-Prey Domain Predator-Prey domain agent: the predator (blue) Markov decision process (MDP) prey (red) is part of the environment on a torus ('wrap around world') Formalization: states (-3,4) actions N,W,S,E transitions probability of failing to move, prey moves rewards reward for capturing

126 Example: Predator-Prey Domain Predator-Prey domain agent: the predator (blue) Markov decision process (MDP) prey (red) is part of the Markovian environment state s......which on a is torus observed ('wrap around world') policy π maps states actions Value function Q(s,a) Formalization: Value iteration: way to compute it. states (-3,4) actions N,W,S,E transitions probability of failing to move, prey moves rewards reward for capturing

127 Partial Observability Now: partial observability E.g., limited range of sight MDP + observations explicit observations observation probabilities noisy observations (detection probability) o=' nothing '

128 Partial Observability Now: partial observability E.g., limited range of sight MDP + observations explicit observations observation probabilities noisy observations (detection probability) o=(,)

129 Partial Observability Now: partial observability E.g., limited range of sight MDP + observations explicit observations observation probabilities noisy observations (detection probability) Can not observe the state Need to maintain a belief over states b(s) Policy maps beliefs to actions π(b)=a π(b)=a o=(,)

130 Partial Observability Now: partial observability E.g., Partially limited range Observable of sight MDP (POMDP) MDP + observations explicit observations observation probabilities noisy observations (detection probability) Can not observe the state Need to maintain a belief over states b(s) Policy maps beliefs to actions π(b)=a o=(,)

131 Partial Observability Now: partial observability E.g., Partially limited range Observable of sight MDP (POMDP) reduction continuous state MDP MDP (in + which observations the belief is the state) Value explicit iterations: observations make use of α-vectors observation (correspond probabilities to complete policies) perform pruning: eliminate dominated α's noisy observations (detection probability) Can not observe the state Need to maintain a belief over states b(s) Policy maps beliefs to actions π(b)=a o=(,)

132 Multiple Agents Now: multiple agents fully observable Formalization: states ((3,-4), (,), (-2,)) actions {N,W,S,E} joint actions {(N,N,N), (N,N,W),...,(E,E,E)} transitions probabilty of failing to move, prey moves rewards reward for capturing jointly?

133 Multiple Agents Now: multiple agents fully observable Formalization: states ((3,-4), (,), (-2,)) actions {N,W,S,E} joint actions {(N,N,N), (N,N,W),...,(E,E,E)} transitions probability of failing to move, prey moves rewards reward for capturing jointly

134 Multiple Agents Now: multiple agents fully observable Multiagent MDP [Boutilier 996] Differences with MDP n agents joint actions a= a, Formalization:, a 2, 2,...,a n transitions and rewards depend on joint actions states ((3,-4), (,), (-2,)) Solution: actions Treat as normal {N,W,S,E} MDP with 'puppeteer agent' joint actions Optimal policy {(N,N,N), π(s)=a (N,N,W),...,(E,E,E)} Every agent executes its part transitions probability of failing to move, prey moves rewards reward for capturing jointly

135 Multiple Agents Now: multiple agents fully observable Catch:? Multiagent MDP [Boutilier 996] Differences with MDP n agents joint actions a= a, Formalization:, a 2, 2,...,a n transitions and rewards depend on joint actions states ((3,-4), (,), (-2,)) Solution: actions Treat as normal {N,W,S,E} MDP with 'puppeteer agent' joint actions Optimal policy {(N,N,N), π(s)=a (N,N,W),...,(E,E,E)} Every agent executes its part transitions probability of failing to move, prey moves rewards reward for capturing jointly

136 Multiple Agents Now: multiple agents fully observable Multiagent MDP [Boutilier 996] Catch: number of joint actions is exponential! (but other than that, conceptually simple.) Differences with MDP n agents joint actions a= a, Formalization:, a 2, 2,...,a n transitions and rewards depend on joint actions states ((3,-4), (,), (-2,)) Solution: actions Treat as normal {N,W,S,E} MDP with 'puppeteer agent' joint actions Optimal policy {(N,N,N), π(s)=a (N,N,W),...,(E,E,E)} Every agent executes its part transitions probability of failing to move, prey moves rewards reward for capturing jointly

137 Multiple Agents & Partial Observability Now: Both partial observability multiple agents

138 Multiple Agents & Partial Observability Now: Both partial observability multiple agents Decentralized POMDPs (Dec-POMDPs) [Bernstein et al. 22] both joint actions and joint observations

139 Multiple Agents & Partial Observability Again we can make a reduction... any idea?

140 Multiple Agents & Partial Observability Again we can make a reduction... Dec-POMDPs MPOMDP (multiagent POMDP) 'puppeteer' agent that receives joint observations takes joint actions requires broadcasting observations! instantaneous, cost-free, noise-free communication optimal [Pynadath and Tambe 22] Without such communication: no easy reduction.

141 The Dec-POMDP Model

g battery power) not instantaneous or noise free scales poorly with number of agents!

142 Acting Based On Local Observations MPOMDP: Act on global information Can be impractical: communication not possible significant cost (e.g battery power) not instantaneous or noise free scales poorly with number of agents! Alternative: act based only on local observations Other side of the spectrum: no communication at all (Also other intermediate approaches: delayed communication, stochastic delays)

143 Formal Model A Dec-POMDP S, A, P T,O, P O, R,h n agents S set of states A set of joint actions P T transition function O set of joint observations P O observation function R reward function h horizon (finite) a= a, a 2,...,a n P(s ' s, a) o= o, o 2,..., o n P(o a, s ') R(s, a)

144 Running Example 2 generals problem small army large army

145 Running Example 2 generals problem S { s L, s S } A i i { (O)bserve, (A)ttack } O i i { (L)arge, (S)mall } small army large army Transitions Both Observe: no state change At least Attack: reset with 5% probability Observations Probability of correct observation:.85 E.g., P(<L, L> s L ) =.85 *.85 =.7225

146 Running Example 2 generals problem S { s L, s S } A i i { (O)bserve, (A)ttack } O i i { (L)arge, (S)mall } small army large army Rewards general attacks: he loses the battle R(*,<A,O>) = - Both generals Observe: small cost R(*,<O,O>) = - Both Attack: depends on state R(s L,<A,A>) = -2 R(s R,<A,A>) = +5

147 Running Example 2 generals problem S { s L, s S } A i i { (O)bserve, (A)ttack } O i i { (L)arge, (S)mall } small army suppose large h=3, army what do you think is optimal in this problem? Rewards general attacks: he loses the battle R(*,<A,O>) = - Both generals Observe: small cost R(*,<O,O>) = - Both Attack: depends on state R(s L,<A,A>) = -2 R(s R,<A,A>) = +5

148 Off-line / On-line phases off-line planning, on-line execution is decentralized Planning Phase Execution Phase π= π, π 2

149 Policy Domain What do policies look like? In general histories actions before: more compact representations... Now, this is difficult: no such representation known! So we will be stuck with histories

150 Policy Domain What do policies look like? In general histories actions before: more compact representations... Now, this is difficult: no such representation known! So we will be stuck with histories Most general, AOHs: (a,, i, i i i t i o, i a i,..., a t i, o t ii ) But: can restrict to deterministic policies only need OHs: o o i, i t i =(o, i...,o t ii )

151 No Compact Representation? There are a number of types of beliefs considered Joint Belief, b(s) (as in MPOMDP) [Pynadath and Tambe 22] compute b(s) using joint actions and observations Problem:??

152 No Compact Representation? There are a number of types of beliefs considered Joint Belief, b(s) (as in MPOMDP) [Pynadath and Tambe 22] compute b(s) using joint actions and observations Problem: agents do not know those during execution

153 No Compact Representation? There are a number of types of beliefs considered Joint Belief, b(s) (as in MPOMDP) [Pynadath and Tambe 22] compute b(s) using joint actions and observations Problem: agents do not know those during execution Multiagent belief, b i (s,q -i ) [Hansen et al. 24] belief over (future) policies of other agents Need to be able to predict the other agents! for belief update P(s' s,a i,a -i ), P(o a i,a -i,s'), and prediction of R(s,a i,a -i ) form of those other policies? most general: π j j : o o jj a jj if they use beliefs? infinite recursion of beliefs!

154 Goal of Planning Find the optimal joint policy π = π, π 2 where individual policies map OHs to actions π i : O i A i What is the optimal one? Define value as the expected sum of rewards: h V (π)=e[ t = R(s,a) π,b ] optimal joint policy is one with maximal value (can be more that achieve this)

155 Goal of Planning Find the optimal joint policy where individual policies map OHs to actions What is the optimal one? π = π, π 2 Optimal policy for 2 generals, h=3 value= () ()--> observe (o_small) --> observe (o_large) --> observe (o_small,o_small) --> attack (o_small,o_large) --> attack (o_large,o_small) --> h attack (o_large,o_large) --> observe Define value as the expected sum of rewards: V (π)=e[ t = () ()--> observe (o_small) --> observe (o_large) --> observe R(s,a) π,b ] optimal (o_small,o_small) joint policy --> is attack one with maximal value (o_small,o_large) (can --> be more that attack achieve this) (o_large,o_small) --> attack (o_large,o_large) --> observe π i : O i A i

156 Goal of Planning Find the optimal joint policy where individual policies map OHs to actions What is the optimal one? π = π, π 2 Optimal policy for 2 generals, h=3 value= () ()--> observe (o_small) --> observe (o_large) --> observe (o_small,o_small) --> attack (o_small,o_large) --> attack (o_large,o_small) --> h attack (o_large,o_large) --> observe Define value as the expected sum of rewards: V (π)=e[ t = () ()--> observe (o_small) --> observe (o_large) --> observe R(s,a) π,b ] optimal (o_small,o_small) joint policy --> is attack one with maximal value (o_small,o_large) (can --> be more that attack achieve this) (o_large,o_small) --> attack (o_large,o_large) --> observe conceptually: π i : O i A i what should policy optimize to allow for good coordination (thus high value)?

157 Coordination vs. Exploitation of Local Information Inherent trade-off coordination vs. exploitation of local information Ignore own observations 'open loop plan' E.g., ATTACK on 2nd time step + maximally predictable - low quality Ignore coordination b i (s)= q i b(s, q i ) E.g., compute an individual belief b i (s) and execute the MPOMDP policy + uses local information - likely to result in mis-coordination Optimal policy π should balance between these.

158 Planning Methods

159 Brute Force Search We can compute the value of a joint policy V(π) using a Bellman-like equation [Oliehoek 22] So the stupidest algorithm is: compute V(π), for all π select a π with maximum value Number of joint policies is huge! (doubly exponential in horizon h) Clearly intractable... h num. joint policies e e e e e+53

160 Brute Force Search We can compute the value of a joint policy V(π) using a Bellman-like equation [Oliehoek 22] So the stupidest No easy way algorithm out... is: No easy way out... compute The V(π), for all π problem is select a π with maximum value NEXP-complete [Bernstein et al. 22] NEXP-complete [Bernstein et al. 22] most likely (assuming EXP!= NEXP) doubly exponential time required. (doubly exponential in horizon h) Number doubly exponential of joint policies time required. is huge! Clearly intractable... h num. joint policies e e e e e+53

161 Brute Force Search We can compute the value of a joint policy V(π) using a Bellman-like equation [Oliehoek 22] So the stupidest No easy way algorithm out... is: No easy way out... compute The V(π), for all π problem is select a π with maximum value NEXP-complete [Bernstein et al. 22] NEXP-complete [Bernstein et al. 22] most likely (assuming EXP!= NEXP) doubly exponential time required. (doubly exponential in horizon h) Number doubly exponential of joint policies time required. is huge! Clearly intractable... h num. joint policies e e e e+76 Still, there are better algorithms 8 that 3.352e+53 work better for at least some problems... Useful to understand what optimal really means! (trying to compute it helps understanding)

162 Algorithmic Developments Dynamic Programming DP for POSGs/Dec-POMDPs [Hansen et al. 24] Improvements to exhaustive backup [Amato et al. 29] Compression of values (LPC) [Boularias & Chaib-draa 28] (Point-based) Memory bounded DP [Seuken & Zilberstein 27a] Improvements to PB backup [Seuken & Zilberstein 27b, Carlin and Zilberstein, 28; Dibangoye et al, 29; Amato et al, 29; Wu et al, 2, etc.] Heuristic Search No backtracking: just most promising path [Emery-Montemerlo et al. 24, Oliehoek et al. 28] Clustering of histories: reduce number of child nodes [Oliehoek et al. 29] Incremental expansion: avoid expanding all child nodes [Spaan et al. 2, Oliehoek et al., 23] MILP [Aras and Dutech 2]

163 State of The Art To get an impression... Optimal solutions Improvements of MAA* lead to significant increases but problem dependent h MILP LPC GMAA-ICE* * dec-tiger runtime (s) h MILP LPC GMAA-ICE* 5 25 <. 5.94* Approximate (no quality guarantees) broadcast channel runtime (s) * excluding heuristic MBDP: linear in horizon [Seuken & zilberstein 27a] Rollout sampling extension: up to 2 agents [Wu et al. 2b] Transfer planning: use smaller problems to solve large (structured) problems (up to ) agents [Oliehoek 2]

164 Communication

165 Single-agent POMDPs time s t o t t +

166 Single-agent POMDPs time s t o t a s t+ t +

167 Single-agent POMDPs r time s t o t a s t+ t +

168 Dec-POMDPs time s t o i o j t t +

169 Dec-POMDPs time s t o i o j t a i a j s t+ t +

170 Dec-POMDPs r time s t o i o j t a i a j s t+ t +

171 POMDPs vs. Dec-POMDPs Frameworks for acting optimally given: Limited sensing. Stochastic environments.

172 POMDPs vs. Dec-POMDPs Frameworks for acting optimally given: Limited sensing. Stochastic environments. Dec-POMDPs: Decentralized execution. Usually centralized, off-line planning. No common state estimate, no joint belief. Optimal policies based on individual observation histories. NEXP-Complete, doubly-exponential in the horizon.

Adding communication Implicit communication in Dec-POMDPs. Agent actions affect the state which affects other agents observations. Explicit communication can be added as well.

173 Adding communication Implicit communication in Dec-POMDPs. Agent actions affect the state which affects other agents observations. Explicit communication can be added as well. Equivalent to Dec-POMDP (Goldman and Zilberstein, 24). Issues Information sharing can reduce complexity What to communicate and to whom? When to coordinate? Semantics predominantly predefined.

174 Communication and interdependence Perfect Multi-agent POMDP Communication Auctioned POMDPs Dec-POMDP with Comms None Independent POMDPs Dec-POMDP Low High Interdependence

175 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages

176 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages 2. Multiagent POMDPs with delayed communication (Spaan et al., 28), (Oliehoek and Spaan, 22a) Planning with stochastically delayed information sharing

177 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages 2. Multiagent POMDPs with delayed communication (Spaan et al., 28), (Oliehoek and Spaan, 22a) Planning with stochastically delayed information sharing 3. Offline communication policies for factored multiagent POMDPs (Messias et al., 2) Planning when communication can be reduced

178 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages 2. Multiagent POMDPs with delayed communication (Spaan et al., 28), (Oliehoek and Spaan, 22a) Planning with stochastically delayed information sharing 3. Offline communication policies for factored multiagent POMDPs (Messias et al., 2) Planning when communication can be reduced 4. Event-driven models (Messias et al., 23) Planning for asynchronous execution

179 Optimizing semantics Learning the meaning of messages

180 Example heaven or hell Task: meet in heaven. Priest knows the location of heaven. Heaven? Hell? Heaven? Hell? Priest

181 Proposed model (Spaan et al., 26) Features: Decentralized: Each agent considers only its own local state plus some uncontrollable state features, shared by all agents. avoid computing an (approximate) solution of the centralized planning problem

182 Proposed model (Spaan et al., 26) Features: Decentralized: Each agent considers only its own local state plus some uncontrollable state features, shared by all agents. avoid computing an (approximate) solution of the centralized planning problem Communication Semantics of sending a particular message are part of the optimization problem. communication is an integral part of an agent s reasoning, not an add-on.

183 Proposed model (Spaan et al., 26) Features: Decentralized: Each agent considers only its own local state plus some uncontrollable state features, shared by all agents. avoid computing an (approximate) solution of the centralized planning problem Communication Semantics of sending a particular message are part of the optimization problem. communication is an integral part of an agent s reasoning, not an add-on. Environment The agents inhabit a stochastic environment that is only partially observable to them.

184 Proposed model: discussion Agent i s policy π i maps a belief over its local state (s, s i ) to an action (as in a POMDP). A message sent by an agent i as part of its action is received by agent j at the next timestep as part of its observation. Agent j knows π i a message conveys information about the shared state S. Information is represented in agent i s observation model.

185 Communication example 2 heaven left heaven heaven heaven left right right.5 Agent s belief Agent 2 s belief Agent just moved to the Priest location.

186 Communication example 2 heaven left heaven heaven heaven left right right.5 Agent s belief Agent 2 s belief Agent observes heaven-left and executes south/send_.

187 Communication example heaven left heaven heaven heaven left right right.5 Agent s belief Agent 2 s belief 2 Agent 2 observes no-walls/received_ and executes west.

188 POMDP model for a single agent Fixed policies of other agents can be treated as part of the environment. Agent j s policy π j influences agent i s POMDP model: Observation model through communication. Reward function through joint reward. Constructing a POMDP model for i requires statistics p(s j ) and p(a j s j ). Approximated by simulating {πi,π j }.

189 Delayed communication Planning with stochastically delayed information sharing

190 Multiagent POMDPs with delayed communication Communication Perfect Multi-agent POMDP MPOMDP Delayed Comm None Independent POMDPs Dec-POMDP Low High Interdependence

191 Instantaneous communication robot i robot j time s t o i o j o j o i } decision interval a i a j s t+ } decision interval

192 Instantaneous communication Instantaneous, perfect communication reduces a Dec-POMDP to a POMDP. (Pynadath and Tambe, 22) Observation sharing each agent knows belief b t at time t. Piecewise-linear and convex (PWLC) value function. But this synchronization step takes time. time s t o j o i } o j o i b is common knowledge a t i a j s t+ }

193 Delayed communication But what if communication is not instantaneous? Delayed communication is useful in several settings: Smart protection schemes Multi-robot systems Distributed video surveillance Upper bound on Dec-POMDP value function (Oliehoek et al., 28)

194 Delayed communication agent i agent j time s t o i o i o j o j } decision interval a i aj s t+ } decision interval

195 Fixed delay communication Each agent knows the last commonly known b t k, k >. If k =, the optimal value function is PWLC. (Hsu and Marcus, 982) Equal to the QBG value function (Oliehoek et al., 28) Otherwise the value function is not separable (Varaiya and Walrand, 978) It is not a function over the joint belief space. But what if the delay can vary?

196 Stochastic communication delays (Spaan et al., 28) Probability that synchronization succeeds within a particular stage i: p itd (s) Optimal value function: Q SD = R + p TD F TD + p TD F TD + p 2TD F 2TD +..., where F itd is the exp. future reward given delay i. We assume during planning that delay is at most step: p D = p TD + p 2TD + = p TD, and define an approximate value function as Q SD = R + p TD F TD + p D F TD. We prove that Q SD is PWLC over the joint belief space.

197 Algorithms Algorithms Point-based approximate POMDP techniques transfer. (Spaan et al., 28) Exhaustive backups of the TD delay value function can be sped up by tree-based pruning. (Oliehoek and Spaan, 22b) For k >, we propose an online algorithm similar to Dec-COMM. (Roth et al., 25) Simulation results The policies consider potential future communication capabilities.

198 Experimental results Meet in corner G CCW CCW CW CW CW CCW S The CW route is cheaper, and with uniform p TD it is chosen. We set p TD (s) =, s CW and to everywhere else. Now the agents choose the CCW route, leading to higher payoff.

199 Experimental results Policies computed for a specific value of p TD, but evaluated under different values of p TD. Meet in corner (large) Ṽ(b ) p TD π p TD = π p TD =.5 π p TD =.

200 Offline communication policies Planning when communication can be reduced

201 Exploiting local interactions Real-world domains have structure that can be exploited. Factored models. Localized interactions can lead to reduced communication.

202 Exploiting local interactions Real-world domains have structure that can be exploited. Factored models. Localized interactions can lead to reduced communication.

203 Exploiting local interactions Real-world domains have structure that can be exploited. Factored models. Localized interactions can lead to reduced communication.

204 Factored MPOMDPs Perfect Factored MPOMDPs Multi-agent POMDP Communication None Independent POMDPs Dec-POMDP Low High Interdependence

205 Relay example State space: S = X X 2 where X = {l, l 2 } X 2 = {r, r 2 } ; Actions: A = A 2 = {shuffle, exchange, sense/noop}; Observations: O = O 2 = {door, wall, idle}; Agents should exchange when both are near the doorway. Unsuccessful exchanges are penalized. L D 2 R X l l 2 L2 D R2 X 2 r r 2 shuffle exchange noop

206 Factored Multiagent POMDPs In Multiagent POMDPs, joint beliefs map to joint actions. It is advantageous to factor the belief state itself. (Messias et al., 2) Agents maintain belief states over locally relevant factors. Joint belief is approximated in factorized form. Multiagent POMDP policies are mapped to individual belief spaces Query communication is used when other agents factors are required. Extends ideas of Roth et al. (27) to the MPOMDP setting.

207 Linear and local supports of the value function Linear supports: Local supports: Shuffle Exchange Sense b (L) L D2 R L2 D R2 shuffle; exchange; sense/noop.

208 Event-driven models Planning for asynchronous execution

209 Discrete Multiagent Markov Decision Processes Often multiagent decision-theoretic models are discrete: Discrete state space S Discrete action space A Policies are step-based, abstract w.r.t. time: π(s) = {δ (s),...,δ h (s)} Communication required if only local states can be observed. Pr(s s,a) World State (discrete) a a 2 Agent Agent 2 R(s,a)

210 Pr(s s,a) a World State (discrete) a2 R(s,a) Pr(s s,a) a World State (discrete) a2 R(s,a) In the Real World... Pr(s s,a) World State (discrete) a a2 R(s,a) Pr(s s,a) World State (discrete) R(s,a) a a2

211 Real-Time Execution Strategies Synchronous Execution: T 2T 3T 4T s s T +δ s 2 s 3 a() a() a(2) a(3) Agent a() a() a(2) a(3) Agent 2

212 Real-Time Execution Strategies Synchronous Execution: T 2T 3T 4T s s T +δ s 2 s 3 a() a() a(2) a(3) Agent a() a() a(2) a(3) Agent 2 Event-Driven (Asynchronous) Execution: t e t e2 t e3 s s s 2 s 3 a() a() a(2) a(3) Agent a() a() a(2) a(3) Agent 2

213 Semi-Markov Decision Processes SMDP: Extension of a discrete MDP with a temporal distributions f(s, a, s ): p(t, s s, a) = p(t s, a, s )Pr(s s, a) = f(s, a, s )T(s, a, s ) p(t s,a,s ) World State t a c(s,a,s ) R(s, a)

214 Generalized Semi-Markov Decision Processes Generalized SMDPs (Younes and Simmons, 24) State transitions abstracted as events, e E. Transition probabilities also depend on the history of events. We apply a GSMDP-based solution to a team of real robots. (Messias et al., 23) Extension to Partially-Observable domains a a 2 s p(t s 2, e n) p(t s 3, e n) t Pr(s 2 s,a)pr(s3 s,a) s 2 s 3 t

with configurable decision rate S = 26, A = 36; Results both

215 Experimental Setup A cooperative decision-making problem for two robots; Implemented both as a GSMDP and a synchronous MDP with configurable decision rate S = 26, A = 36; Results both with realistic physics-based simulation and with real robots.

216 Robot Results - Video

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI