Decision-theoretic approaches to planning, coordination and communication in multiagent systems

Size: px
Start display at page:

Download "Decision-theoretic approaches to planning, coordination and communication in multiagent systems"

Transcription

1 Decision-theoretic approaches to planning, coordination and communication in multiagent systems Matthijs Spaan Frans Oliehoek 2 Stefan Witwicki 3 Delft University of Technology 2 U. of Liverpool & U. of Amsterdam 3 EPFL ESSENCE Autumn School Ischia, Italy October 29, 24

2 Introduction

3 Introduction Goal in Artificial Intelligence: to build intelligent agents. Our definition of intelligent : perform an assigned task as well as possible. Problem: how to act? We will explicitly model uncertainty.

4 Motivation Intelligent distributed systems are becoming ubiquitous: Smart energy grid infrastructure Surveillance camera networks Autonomous guided vehicles, vehicular networks Internet or smart phone applications Devices can sense, compute, act and interact.

5 Motivation Intelligent distributed systems are becoming ubiquitous: Smart energy grid infrastructure Surveillance camera networks Autonomous guided vehicles, vehicular networks Internet or smart phone applications Devices can sense, compute, act and interact. Growing need for scalable and flexible multiagent planning. Key issue: uncertainty

6 Outline Before the break:. Introduction to decision making under uncertainty 2. Planning under action uncertainty (MDPs) 3. Planning under sensing uncertainty (POMDPs) After the break:. Multiagent planning (Dec-POMDPs) 2. Communication

7 Agents An agent is a (rational) decision maker who is able to perceive its external (physical) environment and act autonomously upon it (Russell and Norvig, 23). Rationality means reaching the optimum of a performance measure. Examples: humans, robots, some software programs.

8 Agents environment action agent state observation It is useful to think of agents as being involved in a perception-action loop with their environment. But how do we make the right decisions?

9 Planning Planning: A plan tells an agent how to act. For instance A sequence of actions to reach a goal. What to do in a particular situation. We need to model: the agent s actions its environment its task We will model planning as a sequence of decisions.

10 Classic planning Classic planning: sequence of actions from start to goal. Task: robot should get to gold as quickly as possible. Actions: Limitations: New plan for each start state. Environment is deterministic.

11 Classic planning Classic planning: sequence of actions from start to goal. Task: robot should get to gold as quickly as possible. Actions: Limitations: New plan for each start state. Environment is deterministic. Three optimal plans:,,.

12 Conditional planning Assume our robot has noisy actions (wheel slip, overshoot). We need conditional plans. Map situations to actions.

13 Decision-theoretic planning Positive reward when reaching goal, small penalty for all other actions. Agent s plan maximizes value: the sum of future rewards. Decision-theoretic planning successfully handles noise in acting and sensing.

14 Decision-theoretic planning Plan #: Reward:

15 Decision-theoretic planning Values of this plan: Reward:

16 Decision-theoretic planning Values of this plan: Reward:

17 Decision-theoretic planning Plan #2: Reward:

18 Decision-theoretic planning Values of this plan: Reward:

19 Decision-theoretic planning Values of this plan: Reward:

20 Decision-theoretic planning Optimal values (encode optimal plan): Reward:

21 Markov Decision Processes

22 Sequential decision making under uncertainty Uncertainty is abundant in real-world planning domains. Bayesian approach probabilistic models. Main assumptions: Sequential decisions: problems are formulated as a sequence of independent decisions; Markovian environment: the state at time t depends only on the events at time t ; Evaluative feedback: use of a reinforcement signal as performance measure (reinforcement learning);

23 Transition model For instance, robot motion is inaccurate. Transitions between states are stochastic. p(s s, a) is the probability to jump from state s to state s after taking action a.?????

24 MDP Agent environment π action a obs. s reward r state s

25 MDP Agent environment π action a p(s s, a) obs. s reward r state s

26 MDP Agent environment π action a R(s, a) obs. s reward r state s

27 Optimality criterion For instance, agent should maximize the value [ h E γ t R t ], () t= where h is the planning horizon, can be finite or γ is a discount rate, γ < Reward hypothesis (Sutton and Barto, 998): All goals and purposes can be formulated as the maximization of the cumulative sum of a received scalar signal (reward).

28 Discrete MDP model Discrete Markov Decision Process model (Puterman, 994; Bertsekas, 2): Time t is discrete. State space S. Set of actions A. Reward function R : S A R. Transition model p(s s, a), T a : S A (S). Initial state s is drawn from (S). The Markov property entails that the next state s t+ only depends on the previous state s t and action a t : p(s t+ s t, s t,...,s, a t, a t,...,a ) = p(s t+ s t, a t ). (2)

29 A simple problem Problem: An autonomous robot must learn how to transport material from a deposit to a building facility. Load/Unload 2 3 Material Move Left/Move Right Building facility (thanks to F. Melo)

30 Load/Unload as an MDP Load/Unload 2 3 Material Move Left/Move Right Building facility States: S = { U, 2 U, 3 U, L, 2 L, 3 L }; U Robot in position (unloaded); 2 U Robot in position 2 (unloaded); 3 U Robot in position 3 (unloaded); L Robot in position (loaded); 2 L Robot in position 2 (loaded); 3 L Robot in position 3 (loaded) Actions: A = {Left, Right, Load, Unload};

31 Load/Unload as an MDP () Transition probabilities: Left / Right move the robot in the corresponding direction; Load loads material (only in position ); Unload unloads material (only in position 3). Ex: (2 L, Right) 3 L ; (3 L, Unload) 3 U ; ( L, Unload) L. Reward: We assign a reward of + for every unloaded package (payment for the service).

32 Load/Unload as an MDP (2) For each action a A, T a is a matrix. Ex: T Right = Recall: S = { U, 2 U, 3 U, L, 2 L, 3 L }.

33 Load/Unload as an MDP (3) The reward R(s, a) can also be represented as a matrix Ex: R = + S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

34 Policies and value Policy π: tells the agent how to act. A deterministic policy π : S A is a mapping from states to actions. Value: how much reward E[ h t= γt R t ] does the agent expect to gather. Value denoted as Q π (s, a): start in s, do a and follow π afterwards.

35 Policies and value () Extracting a policy π from a value function Q is easy: π(s) = arg max Q(s, a). (3) a A Optimal policy π : one that maximizes E[ h t= γt R t ] (for every state). In an infinite-horizon MDP there is always an optimal deterministic stationary (time-independent) policy π. There can be many optimal policies π, but they all share the same optimal value function Q.

36 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

37 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

38 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

39 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

40 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

41 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =???????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

42 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

43 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q =?????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

44 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

45 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q 2 =???????????????????????? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

46 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q 2 =??? S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

47 Dynamic programming Since S and A are finite, Q (s, a) is a matrix. Iterations of dynamic programming (γ =.95): Q = Q 2 = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

48 Dynamic programming Iterations of dynamic programming (γ =.95): Q 5 = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

49 Dynamic programming Iterations of DP: Q 2 = S = { U, 2 U, 3 U, L, 2 L, 3 L }, A = {Left, Right, Load, Unload}

50 Dynamic programming Final Q and policy: Q = π = Load Left Left Right Right Unload

51 Value iteration Value iteration: successive approximation technique. Start with all values set to. In order to consider one step deeper into the future, i.e., to compute V n+ from V n : Q n+ (s, a) := R(s, a)+γ s S p(s s, a) max a A Q n(s, a ), (4) which is known as the dynamic programming update or Bellman backup. Bellman (957) equation: Q (s, a) = R(s, a)+γ s S p(s s, a) max a A Q (s, a ). (5)

52 Value iteration Value iteration discussion: As n, value iteration converges. Value iteration has converged when the largest update δ in an iteration is below a certain threshold ǫ. Exhaustive sweeps are not required for convergence, provided that in the limit all states are visited infinitely often. This can be exploited by backing up the most promising states first, known as prioritized sweeping.

53 Q-learning Reinforcement-learning techniques learn from experience, no knowledge of the model is required. Q-learning update: [ ] Q(s, a) = ( β) Q(s, a)+β r +γmax a A Q(s, a ), (6) where < β is a learning rate.

54 Q-learning Q-learning discussion: Q-learning is guaranteed to converge to the optimal Q-values if all Q(s, a) values are updated infinitely often. In order to make sure all actions will eventually be tried in all states exploration is necessary. A common exploration method is to execute a random action with small probability ǫ, which is known as ǫ-greedy exploration.

55 Solution methods: MDPs Model based Basic: dynamic programming (Bellman, 957), value iteration, policy iteration. Advanced: prioritized sweeping, function approximators. Model free, reinforcement learning (Sutton and Barto, 998) Basic: Q-learning, TD(λ), SARSA, actor-critic. Advanced: generalization in infinite state spaces, exploration/exploitation issues.

56 POMDPs

57 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance)

58 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation

59 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring

60 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring Dialog systems

61 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring Dialog systems Vision-based robotics

62 Beyond MDPs MDPs have been very successful, but requires to have an observable Markovian state. Many domains this is impossible (or expensive) to obtain: Diagnosis (medical, maintenance) Robot navigation Tutoring Dialog systems Vision-based robotics Fault recovery

63 Observation model Imperfect sensors. Partially observable environment: Sensors are noisy. Sensors have a limited view. p(o s, a) is the probability the agent receives observation o in state s after taking action a.

64 POMDP Agent environment π action a obs. o reward r state s

65 POMDP Agent environment π action a p(s s, a) obs. o reward r state s

66 POMDP Agent environment π action a p(o s, a) obs. o reward r state s

67 POMDP Agent environment π action a R(s, a) obs. o reward r state s

68 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty.

69 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O.

70 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions.

71 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states.

72 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a).

73 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a). A planning horizon h (finite or ).

74 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a). A planning horizon h (finite or ). A discount rate γ <.

75 POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 998): Framework for agent planning under uncertainty. Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s, a): models the effect of actions. Observation model p(o s, a): relates observations to states. Task is defined by a reward model R(s, a). A planning horizon h (finite or ). A discount rate γ <. Goal is to compute plan, or policy π, that maximizes long-term reward.

76 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob

77 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic POMDP: memory-based (optimal) Value

78 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic POMDP: memory-based (optimal) Value V = t= γt r = r γ

79 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic POMDP: memory-based (optimal) Value V = t= γt r = r γ V max = r γr γ

80 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic V = POMDP: memory-based (optimal) Value V = t= γt r = r γ V max = r γr γ

81 Memory In POMDPs memory is required for optimal decision making. In this non-observable example (Singh et al., 994): r goa A gob, +r goa, +r B r gob Policy MDP: optimal policy POMDP: memoryless deterministic POMDP: memoryless stochastic V = POMDP: memory-based (optimal) V min = γr γ r Value V = t= γt r = r γ V max = r γr γ

82 Beliefs Beliefs: The agent maintains a belief b(s) of being at state s. After action a A and observation o O the belief b(s) can be updated using Bayes rule: b (s ) p(o s ) s p(s s, a)b(s) The belief vector is a Markov signal for the planning task.

83 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

84 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

85 Belief update example True situation: Robot s belief:.5.25? Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

86 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

87 Belief update example True situation: Robot s belief:.5.25? Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

88 Belief update example True situation: Robot s belief:.5.25 Observations: door or corridor, % noise. Action: moves 3 (2%), 4 (6%), or 5 (2%) states.

89 MDP-based algorithms Exploit belief state, and use the MDP solution as a heuristic. Most likely state (Cassandra et al., 996): π MLS (b) = π (arg max s b(s)). Q MDP (Littman et al., 995): π QMDP (b) = arg max a s b(s)q (s, a). C a b a A b b I a.5.5 A c b c + D a (Parr and Russell, 995)

90 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S.

91 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S. Stochastic Markovian transition model p(b o a b, a) = p(o b, a). This is the normalizer of Bayes rule.

92 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S. Stochastic Markovian transition model p(b o a b, a) = p(o b, a). This is the normalizer of Bayes rule. Reward function R(b, a) = s R(s, a)b(s). This is the average reward with respect to b(s).

93 POMDPs as continuous-state MDPs A belief-state POMDP can be treated as a continuous-state MDP: Continuous state space : a simplex in [, ] S. Stochastic Markovian transition model p(b o a b, a) = p(o b, a). This is the normalizer of Bayes rule. Reward function R(b, a) = s R(s, a)b(s). This is the average reward with respect to b(s). The robot fully observes the new belief-state b o a after executing a and observing o.

94 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions.

95 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions. The optimal value V of a POMDP satisfies the Bellman optimality equation V = HV : [ V (b) = max R(b, a)+γ ] p(o b, a)v (b o a a) o

96 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions. The optimal value V of a POMDP satisfies the Bellman optimality equation V = HV : [ V (b) = max R(b, a)+γ ] p(o b, a)v (b o a a) o Value iteration repeatedly applies V n+ = HV n starting from an initial V.

97 Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping π : A from beliefs to actions. The optimal value V of a POMDP satisfies the Bellman optimality equation V = HV : [ V (b) = max R(b, a)+γ ] p(o b, a)v (b o a a) o Value iteration repeatedly applies V n+ = HV n starting from an initial V. Computing the optimal value function is a hard problem (PSPACE-complete for finite horizon, undecidable for infinite horizon).

98 Example V R α.5 α 2 α 3 (,) b (,) R(s, a) a a 2 a 3 s s

99 PWLC shape of V n Like V, V n is as well piecewise linear and convex. Rewards R(b, a) = b R(s, a) are linear functions of b. Note that the value of a point b satisfies: V n+ (b) = max a [ b R(s, a)+γ p(o b, a)v n (ba) o ] which involves a maximization over (at least) the vectors R(s, a). Intuitively: less uncertainty about the state (low-entropy beliefs) means better decisions (thus higher value). o

100 Exact value iteration Value iteration computes a sequence of value function estimates V, V 2,...,V n, using the POMDP backup operator H, V n+ = HV n. V V 3 V 2 V (,) (,)

101 Optimal value functions The optimal value function of a (finite-horizon) POMDP is piecewise linear and convex: V(b) = max α b α. V α α 2 α 3 α 4 (,) (,)

102 Vector pruning V α α 2 α 5 α 3 α 4 b b 2 (,) (,) Linear program for pruning: variables: s S, b(s); x maximize: x subject to: b (α α ) x, α V,α α b (S)

103 Optimal POMDP methods Enumerate and prune: Most straightforward: Monahan (982) s enumeration algorithm. Generates a maximum of A V n O vectors at each iteration, hence requires pruning. Incremental pruning (Zhang and Liu, 996; Cassandra et al., 997). Search for witness points: One Pass (Sondik, 97; Smallwood and Sondik, 973). Relaxed Region, Linear Support (Cheng, 988). Witness (Cassandra et al., 994).

104 Sub-optimal techniques Grid-based approximations (Drake, 962; Lovejoy, 99; Brafman, 997; Zhou and Hansen, 2; Bonet, 22). Optimizing finite-state controllers (Platzman, 98; Hansen, 998b; Poupart and Boutilier, 24). Heuristic search in the belief tree (Satia and Lave, 973; Hansen, 998a). Compression or clustering (Roy et al., 25; Poupart and Boutilier, 23; Virin et al., 27). Point-based techniques (Pineau et al., 23; Smith and Simmons, 24; Spaan and Vlassis, 25; Shani et al., 27; Kurniawati et al., 28). Monte Carlo tree search (Silver and Veness, 2).

105 Point-based backup For finite horizon V is piecewise linear and convex, and for infinite horizons V can be approximated arbitrary well by a PWLC value function (Smallwood and Sondik, 973).

106 Point-based backup For finite horizon V is piecewise linear and convex, and for infinite horizons V can be approximated arbitrary well by a PWLC value function (Smallwood and Sondik, 973). Given value function V n and a particular belief point b we can easily compute the vector α b n+ of HV n such that α b n+ = arg max b αn+ k, {α k n+ } k where {α k n+ } HVn k= is the (unknown) set of vectors for HV n. We will denote this operation α b n+ =backup(b).

107 Point-based (approximate) methods Point-based (approximate) value iteration plans only on a limited set of reachable belief points:. Let the robot explore the environment. 2. Collect a set B of belief points. 3. Run approximate value iteration on B.

108 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

109 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 2 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

110 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 2 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

111 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 2 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

112 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

113 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

114 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

115 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 3 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

116 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

117 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

118 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

119 PERSEUS: randomized point-based VI Idea: at every backup stage improve the value of all b B. V 4 (,) b b 2 b 3 b 4 b 5 b 6 b 7 (,) (Spaan and Vlassis, 25)

120 Sub-optimal techniques Grid-based approximations (Drake, 962; Lovejoy, 99; Brafman, 997; Zhou and Hansen, 2; Bonet, 22). Optimizing finite-state controllers (Platzman, 98; Hansen, 998b; Poupart and Boutilier, 24). Heuristic search in the belief tree (Satia and Lave, 973; Hansen, 998a). Compression or clustering (Roy et al., 25; Poupart and Boutilier, 23; Virin et al., 27). Point-based techniques (Pineau et al., 23; Smith and Simmons, 24; Spaan and Vlassis, 25; Shani et al., 27; Kurniawati et al., 28). Monte Carlo tree search (Silver and Veness, 2).

121 Multiagent Planning

122 Multiagent Systems (MASs) Why MASs? If we can make intelligent agents, soon there will be many... Physically distributed systems: centralized solutions expensive and brittle. can potentially provide [Vlassis, 27,Sycara, 998] Speedup and efficiency Robustness and reliability ( graceful degradation ) Scalability and flexibility (adding additional agents)

123 Example: Predator-Prey Domain Predator-Prey domain still single agent! agent: the predator (blue) prey (red) is part of the environment on a torus ('wrap around world') Formalization: states (-3,4) actions N,W,S,E transitions probabilty of failing to move, prey moves rewards reward for capturing?

124 Example: Predator-Prey Domain Predator-Prey domain agent: the predator (blue) prey (red) is part of the environment on a torus ('wrap around world') Formalization: states (-3,4) actions N,W,S,E transitions probability of failing to move, prey moves rewards reward for capturing

125 Example: Predator-Prey Domain Predator-Prey domain agent: the predator (blue) Markov decision process (MDP) prey (red) is part of the environment on a torus ('wrap around world') Formalization: states (-3,4) actions N,W,S,E transitions probability of failing to move, prey moves rewards reward for capturing

126 Example: Predator-Prey Domain Predator-Prey domain agent: the predator (blue) Markov decision process (MDP) prey (red) is part of the Markovian environment state s......which on a is torus observed ('wrap around world') policy π maps states actions Value function Q(s,a) Formalization: Value iteration: way to compute it. states (-3,4) actions N,W,S,E transitions probability of failing to move, prey moves rewards reward for capturing

127 Partial Observability Now: partial observability E.g., limited range of sight MDP + observations explicit observations observation probabilities noisy observations (detection probability) o=' nothing '

128 Partial Observability Now: partial observability E.g., limited range of sight MDP + observations explicit observations observation probabilities noisy observations (detection probability) o=(,)

129 Partial Observability Now: partial observability E.g., limited range of sight MDP + observations explicit observations observation probabilities noisy observations (detection probability) Can not observe the state Need to maintain a belief over states b(s) Policy maps beliefs to actions π(b)=a π(b)=a o=(,)

130 Partial Observability Now: partial observability E.g., Partially limited range Observable of sight MDP (POMDP) MDP + observations explicit observations observation probabilities noisy observations (detection probability) Can not observe the state Need to maintain a belief over states b(s) Policy maps beliefs to actions π(b)=a o=(,)

131 Partial Observability Now: partial observability E.g., Partially limited range Observable of sight MDP (POMDP) reduction continuous state MDP MDP (in + which observations the belief is the state) Value explicit iterations: observations make use of α-vectors observation (correspond probabilities to complete policies) perform pruning: eliminate dominated α's noisy observations (detection probability) Can not observe the state Need to maintain a belief over states b(s) Policy maps beliefs to actions π(b)=a o=(,)

132 Multiple Agents Now: multiple agents fully observable Formalization: states ((3,-4), (,), (-2,)) actions {N,W,S,E} joint actions {(N,N,N), (N,N,W),...,(E,E,E)} transitions probabilty of failing to move, prey moves rewards reward for capturing jointly?

133 Multiple Agents Now: multiple agents fully observable Formalization: states ((3,-4), (,), (-2,)) actions {N,W,S,E} joint actions {(N,N,N), (N,N,W),...,(E,E,E)} transitions probability of failing to move, prey moves rewards reward for capturing jointly

134 Multiple Agents Now: multiple agents fully observable Multiagent MDP [Boutilier 996] Differences with MDP n agents joint actions a= a, Formalization:, a 2, 2,...,a n transitions and rewards depend on joint actions states ((3,-4), (,), (-2,)) Solution: actions Treat as normal {N,W,S,E} MDP with 'puppeteer agent' joint actions Optimal policy {(N,N,N), π(s)=a (N,N,W),...,(E,E,E)} Every agent executes its part transitions probability of failing to move, prey moves rewards reward for capturing jointly

135 Multiple Agents Now: multiple agents fully observable Catch:? Multiagent MDP [Boutilier 996] Differences with MDP n agents joint actions a= a, Formalization:, a 2, 2,...,a n transitions and rewards depend on joint actions states ((3,-4), (,), (-2,)) Solution: actions Treat as normal {N,W,S,E} MDP with 'puppeteer agent' joint actions Optimal policy {(N,N,N), π(s)=a (N,N,W),...,(E,E,E)} Every agent executes its part transitions probability of failing to move, prey moves rewards reward for capturing jointly

136 Multiple Agents Now: multiple agents fully observable Multiagent MDP [Boutilier 996] Catch: number of joint actions is exponential! (but other than that, conceptually simple.) Differences with MDP n agents joint actions a= a, Formalization:, a 2, 2,...,a n transitions and rewards depend on joint actions states ((3,-4), (,), (-2,)) Solution: actions Treat as normal {N,W,S,E} MDP with 'puppeteer agent' joint actions Optimal policy {(N,N,N), π(s)=a (N,N,W),...,(E,E,E)} Every agent executes its part transitions probability of failing to move, prey moves rewards reward for capturing jointly

137 Multiple Agents & Partial Observability Now: Both partial observability multiple agents

138 Multiple Agents & Partial Observability Now: Both partial observability multiple agents Decentralized POMDPs (Dec-POMDPs) [Bernstein et al. 22] both joint actions and joint observations

139 Multiple Agents & Partial Observability Again we can make a reduction... any idea?

140 Multiple Agents & Partial Observability Again we can make a reduction... Dec-POMDPs MPOMDP (multiagent POMDP) 'puppeteer' agent that receives joint observations takes joint actions requires broadcasting observations! instantaneous, cost-free, noise-free communication optimal [Pynadath and Tambe 22] Without such communication: no easy reduction.

141 The Dec-POMDP Model

142 Acting Based On Local Observations MPOMDP: Act on global information Can be impractical: communication not possible significant cost (e.g battery power) not instantaneous or noise free scales poorly with number of agents! Alternative: act based only on local observations Other side of the spectrum: no communication at all (Also other intermediate approaches: delayed communication, stochastic delays)

143 Formal Model A Dec-POMDP S, A, P T,O, P O, R,h n agents S set of states A set of joint actions P T transition function O set of joint observations P O observation function R reward function h horizon (finite) a= a, a 2,...,a n P(s ' s, a) o= o, o 2,..., o n P(o a, s ') R(s, a)

144 Running Example 2 generals problem small army large army

145 Running Example 2 generals problem S { s L, s S } A i i { (O)bserve, (A)ttack } O i i { (L)arge, (S)mall } small army large army Transitions Both Observe: no state change At least Attack: reset with 5% probability Observations Probability of correct observation:.85 E.g., P(<L, L> s L ) =.85 *.85 =.7225

146 Running Example 2 generals problem S { s L, s S } A i i { (O)bserve, (A)ttack } O i i { (L)arge, (S)mall } small army large army Rewards general attacks: he loses the battle R(*,<A,O>) = - Both generals Observe: small cost R(*,<O,O>) = - Both Attack: depends on state R(s L,<A,A>) = -2 R(s R,<A,A>) = +5

147 Running Example 2 generals problem S { s L, s S } A i i { (O)bserve, (A)ttack } O i i { (L)arge, (S)mall } small army suppose large h=3, army what do you think is optimal in this problem? Rewards general attacks: he loses the battle R(*,<A,O>) = - Both generals Observe: small cost R(*,<O,O>) = - Both Attack: depends on state R(s L,<A,A>) = -2 R(s R,<A,A>) = +5

148 Off-line / On-line phases off-line planning, on-line execution is decentralized Planning Phase Execution Phase π= π, π 2

149 Policy Domain What do policies look like? In general histories actions before: more compact representations... Now, this is difficult: no such representation known! So we will be stuck with histories

150 Policy Domain What do policies look like? In general histories actions before: more compact representations... Now, this is difficult: no such representation known! So we will be stuck with histories Most general, AOHs: (a,, i, i i i t i o, i a i,..., a t i, o t ii ) But: can restrict to deterministic policies only need OHs: o o i, i t i =(o, i...,o t ii )

151 No Compact Representation? There are a number of types of beliefs considered Joint Belief, b(s) (as in MPOMDP) [Pynadath and Tambe 22] compute b(s) using joint actions and observations Problem:??

152 No Compact Representation? There are a number of types of beliefs considered Joint Belief, b(s) (as in MPOMDP) [Pynadath and Tambe 22] compute b(s) using joint actions and observations Problem: agents do not know those during execution

153 No Compact Representation? There are a number of types of beliefs considered Joint Belief, b(s) (as in MPOMDP) [Pynadath and Tambe 22] compute b(s) using joint actions and observations Problem: agents do not know those during execution Multiagent belief, b i (s,q -i ) [Hansen et al. 24] belief over (future) policies of other agents Need to be able to predict the other agents! for belief update P(s' s,a i,a -i ), P(o a i,a -i,s'), and prediction of R(s,a i,a -i ) form of those other policies? most general: π j j : o o jj a jj if they use beliefs? infinite recursion of beliefs!

154 Goal of Planning Find the optimal joint policy π = π, π 2 where individual policies map OHs to actions π i : O i A i What is the optimal one? Define value as the expected sum of rewards: h V (π)=e[ t = R(s,a) π,b ] optimal joint policy is one with maximal value (can be more that achieve this)

155 Goal of Planning Find the optimal joint policy where individual policies map OHs to actions What is the optimal one? π = π, π 2 Optimal policy for 2 generals, h=3 value= () ()--> observe (o_small) --> observe (o_large) --> observe (o_small,o_small) --> attack (o_small,o_large) --> attack (o_large,o_small) --> h attack (o_large,o_large) --> observe Define value as the expected sum of rewards: V (π)=e[ t = () ()--> observe (o_small) --> observe (o_large) --> observe R(s,a) π,b ] optimal (o_small,o_small) joint policy --> is attack one with maximal value (o_small,o_large) (can --> be more that attack achieve this) (o_large,o_small) --> attack (o_large,o_large) --> observe π i : O i A i

156 Goal of Planning Find the optimal joint policy where individual policies map OHs to actions What is the optimal one? π = π, π 2 Optimal policy for 2 generals, h=3 value= () ()--> observe (o_small) --> observe (o_large) --> observe (o_small,o_small) --> attack (o_small,o_large) --> attack (o_large,o_small) --> h attack (o_large,o_large) --> observe Define value as the expected sum of rewards: V (π)=e[ t = () ()--> observe (o_small) --> observe (o_large) --> observe R(s,a) π,b ] optimal (o_small,o_small) joint policy --> is attack one with maximal value (o_small,o_large) (can --> be more that attack achieve this) (o_large,o_small) --> attack (o_large,o_large) --> observe conceptually: π i : O i A i what should policy optimize to allow for good coordination (thus high value)?

157 Coordination vs. Exploitation of Local Information Inherent trade-off coordination vs. exploitation of local information Ignore own observations 'open loop plan' E.g., ATTACK on 2nd time step + maximally predictable - low quality Ignore coordination b i (s)= q i b(s, q i ) E.g., compute an individual belief b i (s) and execute the MPOMDP policy + uses local information - likely to result in mis-coordination Optimal policy π should balance between these.

158 Planning Methods

159 Brute Force Search We can compute the value of a joint policy V(π) using a Bellman-like equation [Oliehoek 22] So the stupidest algorithm is: compute V(π), for all π select a π with maximum value Number of joint policies is huge! (doubly exponential in horizon h) Clearly intractable... h num. joint policies e e e e e+53

160 Brute Force Search We can compute the value of a joint policy V(π) using a Bellman-like equation [Oliehoek 22] So the stupidest No easy way algorithm out... is: No easy way out... compute The V(π), for all π problem is select a π with maximum value NEXP-complete [Bernstein et al. 22] NEXP-complete [Bernstein et al. 22] most likely (assuming EXP!= NEXP) doubly exponential time required. (doubly exponential in horizon h) Number doubly exponential of joint policies time required. is huge! Clearly intractable... h num. joint policies e e e e e+53

161 Brute Force Search We can compute the value of a joint policy V(π) using a Bellman-like equation [Oliehoek 22] So the stupidest No easy way algorithm out... is: No easy way out... compute The V(π), for all π problem is select a π with maximum value NEXP-complete [Bernstein et al. 22] NEXP-complete [Bernstein et al. 22] most likely (assuming EXP!= NEXP) doubly exponential time required. (doubly exponential in horizon h) Number doubly exponential of joint policies time required. is huge! Clearly intractable... h num. joint policies e e e e+76 Still, there are better algorithms 8 that 3.352e+53 work better for at least some problems... Useful to understand what optimal really means! (trying to compute it helps understanding)

162 Algorithmic Developments Dynamic Programming DP for POSGs/Dec-POMDPs [Hansen et al. 24] Improvements to exhaustive backup [Amato et al. 29] Compression of values (LPC) [Boularias & Chaib-draa 28] (Point-based) Memory bounded DP [Seuken & Zilberstein 27a] Improvements to PB backup [Seuken & Zilberstein 27b, Carlin and Zilberstein, 28; Dibangoye et al, 29; Amato et al, 29; Wu et al, 2, etc.] Heuristic Search No backtracking: just most promising path [Emery-Montemerlo et al. 24, Oliehoek et al. 28] Clustering of histories: reduce number of child nodes [Oliehoek et al. 29] Incremental expansion: avoid expanding all child nodes [Spaan et al. 2, Oliehoek et al., 23] MILP [Aras and Dutech 2]

163 State of The Art To get an impression... Optimal solutions Improvements of MAA* lead to significant increases but problem dependent h MILP LPC GMAA-ICE* * dec-tiger runtime (s) h MILP LPC GMAA-ICE* 5 25 <. 5.94* Approximate (no quality guarantees) broadcast channel runtime (s) * excluding heuristic MBDP: linear in horizon [Seuken & zilberstein 27a] Rollout sampling extension: up to 2 agents [Wu et al. 2b] Transfer planning: use smaller problems to solve large (structured) problems (up to ) agents [Oliehoek 2]

164 Communication

165 Single-agent POMDPs time s t o t t +

166 Single-agent POMDPs time s t o t a s t+ t +

167 Single-agent POMDPs r time s t o t a s t+ t +

168 Dec-POMDPs time s t o i o j t t +

169 Dec-POMDPs time s t o i o j t a i a j s t+ t +

170 Dec-POMDPs r time s t o i o j t a i a j s t+ t +

171 POMDPs vs. Dec-POMDPs Frameworks for acting optimally given: Limited sensing. Stochastic environments.

172 POMDPs vs. Dec-POMDPs Frameworks for acting optimally given: Limited sensing. Stochastic environments. Dec-POMDPs: Decentralized execution. Usually centralized, off-line planning. No common state estimate, no joint belief. Optimal policies based on individual observation histories. NEXP-Complete, doubly-exponential in the horizon.

173 Adding communication Implicit communication in Dec-POMDPs. Agent actions affect the state which affects other agents observations. Explicit communication can be added as well. Equivalent to Dec-POMDP (Goldman and Zilberstein, 24). Issues Information sharing can reduce complexity What to communicate and to whom? When to coordinate? Semantics predominantly predefined.

174 Communication and interdependence Perfect Multi-agent POMDP Communication Auctioned POMDPs Dec-POMDP with Comms None Independent POMDPs Dec-POMDP Low High Interdependence

175 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages

176 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages 2. Multiagent POMDPs with delayed communication (Spaan et al., 28), (Oliehoek and Spaan, 22a) Planning with stochastically delayed information sharing

177 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages 2. Multiagent POMDPs with delayed communication (Spaan et al., 28), (Oliehoek and Spaan, 22a) Planning with stochastically delayed information sharing 3. Offline communication policies for factored multiagent POMDPs (Messias et al., 2) Planning when communication can be reduced

178 Communication Outline. Optimizing semantics (Spaan et al., 26) Learning the meaning of messages 2. Multiagent POMDPs with delayed communication (Spaan et al., 28), (Oliehoek and Spaan, 22a) Planning with stochastically delayed information sharing 3. Offline communication policies for factored multiagent POMDPs (Messias et al., 2) Planning when communication can be reduced 4. Event-driven models (Messias et al., 23) Planning for asynchronous execution

179 Optimizing semantics Learning the meaning of messages

180 Example heaven or hell Task: meet in heaven. Priest knows the location of heaven. Heaven? Hell? Heaven? Hell? Priest

181 Proposed model (Spaan et al., 26) Features: Decentralized: Each agent considers only its own local state plus some uncontrollable state features, shared by all agents. avoid computing an (approximate) solution of the centralized planning problem

182 Proposed model (Spaan et al., 26) Features: Decentralized: Each agent considers only its own local state plus some uncontrollable state features, shared by all agents. avoid computing an (approximate) solution of the centralized planning problem Communication Semantics of sending a particular message are part of the optimization problem. communication is an integral part of an agent s reasoning, not an add-on.

183 Proposed model (Spaan et al., 26) Features: Decentralized: Each agent considers only its own local state plus some uncontrollable state features, shared by all agents. avoid computing an (approximate) solution of the centralized planning problem Communication Semantics of sending a particular message are part of the optimization problem. communication is an integral part of an agent s reasoning, not an add-on. Environment The agents inhabit a stochastic environment that is only partially observable to them.

184 Proposed model: discussion Agent i s policy π i maps a belief over its local state (s, s i ) to an action (as in a POMDP). A message sent by an agent i as part of its action is received by agent j at the next timestep as part of its observation. Agent j knows π i a message conveys information about the shared state S. Information is represented in agent i s observation model.

185 Communication example 2 heaven left heaven heaven heaven left right right.5 Agent s belief Agent 2 s belief Agent just moved to the Priest location.

186 Communication example 2 heaven left heaven heaven heaven left right right.5 Agent s belief Agent 2 s belief Agent observes heaven-left and executes south/send_.

187 Communication example heaven left heaven heaven heaven left right right.5 Agent s belief Agent 2 s belief 2 Agent 2 observes no-walls/received_ and executes west.

188 POMDP model for a single agent Fixed policies of other agents can be treated as part of the environment. Agent j s policy π j influences agent i s POMDP model: Observation model through communication. Reward function through joint reward. Constructing a POMDP model for i requires statistics p(s j ) and p(a j s j ). Approximated by simulating {πi,π j }.

189 Delayed communication Planning with stochastically delayed information sharing

190 Multiagent POMDPs with delayed communication Communication Perfect Multi-agent POMDP MPOMDP Delayed Comm None Independent POMDPs Dec-POMDP Low High Interdependence

191 Instantaneous communication robot i robot j time s t o i o j o j o i } decision interval a i a j s t+ } decision interval

192 Instantaneous communication Instantaneous, perfect communication reduces a Dec-POMDP to a POMDP. (Pynadath and Tambe, 22) Observation sharing each agent knows belief b t at time t. Piecewise-linear and convex (PWLC) value function. But this synchronization step takes time. time s t o j o i } o j o i b is common knowledge a t i a j s t+ }

193 Delayed communication But what if communication is not instantaneous? Delayed communication is useful in several settings: Smart protection schemes Multi-robot systems Distributed video surveillance Upper bound on Dec-POMDP value function (Oliehoek et al., 28)

194 Delayed communication agent i agent j time s t o i o i o j o j } decision interval a i aj s t+ } decision interval

195 Fixed delay communication Each agent knows the last commonly known b t k, k >. If k =, the optimal value function is PWLC. (Hsu and Marcus, 982) Equal to the QBG value function (Oliehoek et al., 28) Otherwise the value function is not separable (Varaiya and Walrand, 978) It is not a function over the joint belief space. But what if the delay can vary?

196 Stochastic communication delays (Spaan et al., 28) Probability that synchronization succeeds within a particular stage i: p itd (s) Optimal value function: Q SD = R + p TD F TD + p TD F TD + p 2TD F 2TD +..., where F itd is the exp. future reward given delay i. We assume during planning that delay is at most step: p D = p TD + p 2TD + = p TD, and define an approximate value function as Q SD = R + p TD F TD + p D F TD. We prove that Q SD is PWLC over the joint belief space.

197 Algorithms Algorithms Point-based approximate POMDP techniques transfer. (Spaan et al., 28) Exhaustive backups of the TD delay value function can be sped up by tree-based pruning. (Oliehoek and Spaan, 22b) For k >, we propose an online algorithm similar to Dec-COMM. (Roth et al., 25) Simulation results The policies consider potential future communication capabilities.

198 Experimental results Meet in corner G CCW CCW CW CW CW CCW S The CW route is cheaper, and with uniform p TD it is chosen. We set p TD (s) =, s CW and to everywhere else. Now the agents choose the CCW route, leading to higher payoff.

199 Experimental results Policies computed for a specific value of p TD, but evaluated under different values of p TD. Meet in corner (large) Ṽ(b ) p TD π p TD = π p TD =.5 π p TD =.

200 Offline communication policies Planning when communication can be reduced

201 Exploiting local interactions Real-world domains have structure that can be exploited. Factored models. Localized interactions can lead to reduced communication.

202 Exploiting local interactions Real-world domains have structure that can be exploited. Factored models. Localized interactions can lead to reduced communication.

203 Exploiting local interactions Real-world domains have structure that can be exploited. Factored models. Localized interactions can lead to reduced communication.

204 Factored MPOMDPs Perfect Factored MPOMDPs Multi-agent POMDP Communication None Independent POMDPs Dec-POMDP Low High Interdependence

205 Relay example State space: S = X X 2 where X = {l, l 2 } X 2 = {r, r 2 } ; Actions: A = A 2 = {shuffle, exchange, sense/noop}; Observations: O = O 2 = {door, wall, idle}; Agents should exchange when both are near the doorway. Unsuccessful exchanges are penalized. L D 2 R X l l 2 L2 D R2 X 2 r r 2 shuffle exchange noop

206 Factored Multiagent POMDPs In Multiagent POMDPs, joint beliefs map to joint actions. It is advantageous to factor the belief state itself. (Messias et al., 2) Agents maintain belief states over locally relevant factors. Joint belief is approximated in factorized form. Multiagent POMDP policies are mapped to individual belief spaces Query communication is used when other agents factors are required. Extends ideas of Roth et al. (27) to the MPOMDP setting.

207 Linear and local supports of the value function Linear supports: Local supports: Shuffle Exchange Sense b (L) L D2 R L2 D R2 shuffle; exchange; sense/noop.

208 Event-driven models Planning for asynchronous execution

209 Discrete Multiagent Markov Decision Processes Often multiagent decision-theoretic models are discrete: Discrete state space S Discrete action space A Policies are step-based, abstract w.r.t. time: π(s) = {δ (s),...,δ h (s)} Communication required if only local states can be observed. Pr(s s,a) World State (discrete) a a 2 Agent Agent 2 R(s,a)

210 Pr(s s,a) a World State (discrete) a2 R(s,a) Pr(s s,a) a World State (discrete) a2 R(s,a) In the Real World... Pr(s s,a) World State (discrete) a a2 R(s,a) Pr(s s,a) World State (discrete) R(s,a) a a2

211 Real-Time Execution Strategies Synchronous Execution: T 2T 3T 4T s s T +δ s 2 s 3 a() a() a(2) a(3) Agent a() a() a(2) a(3) Agent 2

212 Real-Time Execution Strategies Synchronous Execution: T 2T 3T 4T s s T +δ s 2 s 3 a() a() a(2) a(3) Agent a() a() a(2) a(3) Agent 2 Event-Driven (Asynchronous) Execution: t e t e2 t e3 s s s 2 s 3 a() a() a(2) a(3) Agent a() a() a(2) a(3) Agent 2

213 Semi-Markov Decision Processes SMDP: Extension of a discrete MDP with a temporal distributions f(s, a, s ): p(t, s s, a) = p(t s, a, s )Pr(s s, a) = f(s, a, s )T(s, a, s ) p(t s,a,s ) World State t a c(s,a,s ) R(s, a)

214 Generalized Semi-Markov Decision Processes Generalized SMDPs (Younes and Simmons, 24) State transitions abstracted as events, e E. Transition probabilities also depend on the history of events. We apply a GSMDP-based solution to a team of real robots. (Messias et al., 23) Extension to Partially-Observable domains a a 2 s p(t s 2, e n) p(t s 3, e n) t Pr(s 2 s,a)pr(s3 s,a) s 2 s 3 t

215 Experimental Setup A cooperative decision-making problem for two robots; Implemented both as a GSMDP and a synchronous MDP with configurable decision rate S = 26, A = 36; Results both with realistic physics-based simulation and with real robots.

216 Robot Results - Video

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Decentralized Decision Making!

Decentralized Decision Making! Decentralized Decision Making! in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu

More information

Optimal and Approximate Q-value Functions for Decentralized POMDPs

Optimal and Approximate Q-value Functions for Decentralized POMDPs Journal of Artificial Intelligence Research 32 (2008) 289 353 Submitted 09/07; published 05/08 Optimal and Approximate Q-value Functions for Decentralized POMDPs Frans A. Oliehoek Intelligent Systems Lab

More information

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Steeve Dibangoye Inria / Université de Lorraine Nancy, France jilles.dibangoye@inria.fr Christopher Amato CSAIL / MIT Cambridge, MA, USA camato@csail.mit.edu

More information

Multi-Agent Online Planning with Communication

Multi-Agent Online Planning with Communication Multi-Agent Online Planning with Communication Feng Wu Department of Computer Science University of Sci. & Tech. of China Hefei, Anhui 230027 China wufeng@mail.ustc.edu.cn Shlomo Zilberstein Department

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Accelerated Vector Pruning for Optimal POMDP Solvers

Accelerated Vector Pruning for Optimal POMDP Solvers Accelerated Vector Pruning for Optimal POMDP Solvers Erwin Walraven and Matthijs T. J. Spaan Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands Abstract Partially Observable Markov

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Information Gathering and Reward Exploitation of Subgoals for P

Information Gathering and Reward Exploitation of Subgoals for P Information Gathering and Reward Exploitation of Subgoals for POMDPs Hang Ma and Joelle Pineau McGill University AAAI January 27, 2015 http://www.cs.washington.edu/ai/mobile_robotics/mcl/animations/global-floor.gif

More information

University of Alberta

University of Alberta University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Producing Efficient Error-bounded Solutions for Transition Independent Decentralized MDPs

Producing Efficient Error-bounded Solutions for Transition Independent Decentralized MDPs Producing Efficient Error-bounded Solutions for Transition Independent Decentralized MDPs Jilles S. Dibangoye INRIA Loria, Campus Scientifique Vandœuvre-lès-Nancy, France jilles.dibangoye@inria.fr Christopher

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University European Workshop on Reinforcement Learning 2013 A POMDP Tutorial Joelle Pineau McGill University (With many slides & pictures from Mauricio Araya-Lopez and others.) August 2013 Sequential decision-making

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

Tree-based Solution Methods for Multiagent POMDPs with Delayed Communication

Tree-based Solution Methods for Multiagent POMDPs with Delayed Communication Tree-based Solution Methods for Multiagent POMDPs with Delayed Communication Frans A. Oliehoek MIT CSAIL / Maastricht University Maastricht, The Netherlands frans.oliehoek@maastrichtuniversity.nl Matthijs

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Heuristic Search Value Iteration for POMDPs

Heuristic Search Value Iteration for POMDPs 520 SMITH & SIMMONS UAI 2004 Heuristic Search Value Iteration for POMDPs Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract We present a novel POMDP

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Jiaying Shen Department of Computer Science University of Massachusetts Amherst, MA 0003-460, USA jyshen@cs.umass.edu

More information

IAS. Dec-POMDPs as Non-Observable MDPs. Frans A. Oliehoek 1 and Christopher Amato 2

IAS. Dec-POMDPs as Non-Observable MDPs. Frans A. Oliehoek 1 and Christopher Amato 2 Universiteit van Amsterdam IAS technical report IAS-UVA-14-01 Dec-POMDPs as Non-Observable MDPs Frans A. Oliehoek 1 and Christopher Amato 2 1 Intelligent Systems Laboratory Amsterdam, University of Amsterdam.

More information

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Zhengzhu Feng Department of Computer Science University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Abstract

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis

Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis Journal of Artificial Intelligence Research 22 (2004) 143-174 Submitted 01/04; published 11/04 Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis Claudia V. Goldman Shlomo

More information

Bayes-Adaptive POMDPs 1

Bayes-Adaptive POMDPs 1 Bayes-Adaptive POMDPs 1 Stéphane Ross, Brahim Chaib-draa and Joelle Pineau SOCS-TR-007.6 School of Computer Science McGill University Montreal, Qc, Canada Department of Computer Science and Software Engineering

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Stéphane Ross School of Computer Science McGill University, Montreal (Qc), Canada, H3A 2A7 stephane.ross@mail.mcgill.ca

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs Liam MacDermed College of Computing Georgia Institute of Technology Atlanta, GA 30332 liam@cc.gatech.edu Charles L. Isbell College

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

SpringerBriefs in Intelligent Systems

SpringerBriefs in Intelligent Systems SpringerBriefs in Intelligent Systems Artificial Intelligence, Multiagent Systems, and Cognitive Robotics Series editors Gerhard Weiss, Maastricht University, Maastricht, The Netherlands Karl Tuyls, University

More information

Dialogue as a Decision Making Process

Dialogue as a Decision Making Process Dialogue as a Decision Making Process Nicholas Roy Challenges of Autonomy in the Real World Wide range of sensors Noisy sensors World dynamics Adaptability Incomplete information Robustness under uncertainty

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs

Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs Bikramjit Banerjee 1, Jeremy Lyle 2, Landon Kraemer

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes by Pascal Poupart A thesis submitted in conformity with the requirements for the degree of Doctor of

More information

Communication-Based Decomposition Mechanisms for Decentralized MDPs

Communication-Based Decomposition Mechanisms for Decentralized MDPs Journal of Artificial Intelligence Research 32 (2008) 169-202 Submitted 10/07; published 05/08 Communication-Based Decomposition Mechanisms for Decentralized MDPs Claudia V. Goldman Samsung Telecom Research

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

A Concise Introduction to Decentralized POMDPs

A Concise Introduction to Decentralized POMDPs Frans A. Oliehoek & Christopher Amato A Concise Introduction to Decentralized POMDPs Author version July 6, 2015 Springer This is the authors pre-print version that does not include corrections that were

More information

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

QUICR-learning for Multi-Agent Coordination

QUICR-learning for Multi-Agent Coordination QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

2534 Lecture 4: Sequential Decisions and Markov Decision Processes 2534 Lecture 4: Sequential Decisions and Markov Decision Processes Briefly: preference elicitation (last week s readings) Utility Elicitation as a Classification Problem. Chajewska, U., L. Getoor, J. Norman,Y.

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information