Multi-Objective Decision Making

Size: px

Start display at page:

Download "Multi-Objective Decision Making"

Reginald Hill
5 years ago
Views:

1 Multi-Objective Decision Making Shimon Whiteson & Diederik M. Roijers Department of Computer Science University of Oxford July 25, 2016 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

2 Schedule 14:30-15:10: Motivation & Concepts (Shimon) 15:10-15:20: Short Break 15:20-16:00: Motivation & Concepts cont d (Shimon) 16:00-16:30: Coffee Break 16:30-17:10: Methods & Applications (Diederik) 17:10-17:20: Short Break 17:20-18:00: Methods & Applications cont d (Diederik) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

3 Note Get the latest version of the slides at: This tutorial is based on our survey article: Diederik Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A Survey of Multi-Objective Sequential Decision-Making. Journal of Artificial Intelligence Research, 48:67 113, and Diederik s dissertation: Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

4 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

5 Medical Treatment Chance of being cured, having side effects, or dying Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

6 Traffic Coordination Latency, throughput, fairness, environmental impact, etc. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

7 Mining Commodities Gold collected, silver collected village mine [Roijers et al. 2013, 2014] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

8 Grid World Getting the treasure, minimising fuel costs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

9 Do We Need Multi-Objective Models? Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

10 Do We Need Multi-Objective Models? Sutton s Reward Hypothesis: All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Source: Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

11 Do We Need Multi-Objective Models? Sutton s Reward Hypothesis: All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Source: V : Π R V π = E π [ t r t] π = max π V π Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

12 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

13 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

14 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Scalarization function projects multi-objective value to a scalar: Vw π = f (V π, w) Linear case: V π w = n i=1 w i V π i = w V π A priori prioritization of the objectives Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

15 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Scalarization function projects multi-objective value to a scalar: Vw π = f (V π, w) Linear case: V π w = n i=1 w i V π i = w V π A priori prioritization of the objectives The weak argument is necessary but not sufficient Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

16 Why Multi-Objective Decision Making? The strong argument: a priori scalarization is sometimes impossible, infeasible, or undesirable Instead produce the coverage set of undominated solutions Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

17 Why Multi-Objective Decision Making? The strong argument: a priori scalarization is sometimes impossible, infeasible, or undesirable Instead produce the coverage set of undominated solutions Unknown-weights scenario Weights known in execution phase but not in planning phase Example: mining commodities [Roijers et al. 2013] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

18 Why Multi-Objective Decision Making? Decision-support scenario Quantifying priorities is infeasible Choosing between options is easier Example: medical treatment Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

between options is easier Example: medical treatment Known-weights

19 Why Multi-Objective Decision Making? Decision-support scenario Quantifying priorities is infeasible Choosing between options is easier Example: medical treatment Known-weights scenario: scalarization yields intractable problem Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

20 Summary of Motivation Multi-objective methods are useful because many problems are naturally characterized by multiple objectives and cannot be easily scalarized a priori. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

21 Summary of Motivation Multi-objective methods are useful because many problems are naturally characterized by multiple objectives and cannot be easily scalarized a priori. The burden of proof rests with the a priori scalarization, not with the multi-objective modeling. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

22 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

23 Markov Decision Process (MDP) A single-objective MDP is a tuple S, A, T, R, µ, γ where: S is a finite set of states A is a finite set of actions T : S A S [0, 1] is a transition function R : S A S R is a reward function µ : S [0, 1] is a probability distribution over initial states γ [0, 1) is a discount factor (figure from Poole & Mackworth, Artificial Intelligence: Foundations of Computational Agents, 2010) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

24 Returns & Policies Goal: maximize expected return, which is typically additive: R t = γ k r t+k+1 k=0 A stationary policy conditions only on the current state: π : S A [0, 1] A stationary policy maps states directly to actions: π : S A Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

25 Value Functions in MDPs A state-independent value function V π specifies the expected return when following π from the initial state: A state value function of a policy π: V π = E[R 0 π] (1) V π (s) = E[R t π, s t = s] The Bellman equation restates this expectation recursively for stationary : V π (s) = a π(s, a) s T (s, a, s )[R(s, a, s ) + γv π (s )] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

26 Optimality in MDPs Theorem For any additive infinite-horizon single-objective MDP, there exists a stationary optimal policy [Howard 1960] All optimal share the same optimal value function: V (s) = max π V (s) = max a V π (s) s T (s, a, s )[R(s, a, s ) + γv (s )] Extract the optimal policy using local action selection: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

27 Multi-Objective MDP (MOMDP) Vector-valued reward and value: R : S A S R n V π = E[ γ k r k+1 π] k=0 V π (s) = E[ γ k r t+k+1 π, s t = s] k=0 V π (s) imposes only a partial ordering, e.g., i (s) > Vi π (s) but Vj π (s) < Vj π (s). V π Definition of optimality no longer clear Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

28 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

29 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

30 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

31 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Planning phase: find set of containing optimal solution for each possible w; if w unknown, size of set generally > 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

32 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Planning phase: find set of containing optimal solution for each possible w; if w unknown, size of set generally > 1 Deduce optimal solution set from three factors: 1 Multi-objective scenario 2 Properties of scalarization function 3 Allowable Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

33 Three Factors 1 Multi-objective scenario Known weights single policy Unknown weights or decision support multiple 2 Properties of scalarization function Linear Monotonically increasing 3 Allowable Deterministic Stochastic Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

34 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

35 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

36 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

37 Linear Scalarization Functions Computes inner product of w and V π : V π w = n i=1 w i V π i = w V π, w R n w i quantifies importance of i-th objective Simple and intuitive, e.g., when utility translates to money: revenue = #cans ppc + #bottles ppb Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

38 Linear Scalarization Functions Computes inner product of w and V π : V π w = n i=1 w i V π i = w V π, w R n w i quantifies importance of i-th objective Simple and intuitive, e.g., when utility translates to money: revenue = #cans ppc + #bottles ppb Vw π typically constrained to be a convex combination: i w i 0, w i = 1 utility = #cans i ppc ppc + ppb + #bottles ppb ppc + ppb Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

39 Linear Scalarization & Single Policy No special methods required: just apply f to each reward vector Inner product distributes over addition yielding a normal MDP: Vw π = w V π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )] k=0 k=0 Apply standard methods to an MDP with: R(s, a, s ) = w R(s, a, s ), (2) yielding a single determinstic stationary policy Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

40 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: collecting bottles and cans Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

41 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: collecting bottles and cans Note: only cell in taxonomy that does not require multi-objective methods Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

42 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

43 Multiple Policies Unknown weights or decision support multiple During planning w is unknown Size of solution set is generally > 1 Set should not contain suboptimal for all w Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

44 Undominated & Coverage Sets Definition The undominated set U(Π), is the subset of all possible Π for which there exists a w for which the scalarized value is maximal, U(Π) = {π : π Π w (π Π) V π w V π w } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

45 Undominated & Coverage Sets Definition The undominated set U(Π), is the subset of all possible Π for which there exists a w for which the scalarized value is maximal, U(Π) = {π : π Π w (π Π) V π w V π w } Definition A coverage set CS(Π) is a subset of U(Π) that, for every w, contains a policy with maximal scalarized value, i.e., ( ) CS(Π) U(Π) ( w)( π) π CS(Π) (π Π) Vw π Vw π Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

46 Example Vw π w = true w = false π = π π = π π = π π = π One binary weight feature: only two possible weights Weights are not objectives but two possible scalarizations Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

47 Example Vw π w = true w = false π = π π = π π = π π = π One binary weight feature: only two possible weights Weights are not objectives but two possible scalarizations U(Π) = {π 1, π 2, π 3 } but CS(Π) = {π 1, π 2 } or {π 2, π 3 } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

48 Execution Phase Single policy selected from CS(Π) and executed Unknown weights: weights revealed and maximizing policy selected: π = arg max Vw π π CS(Π) Decision support: CS(Π) is manually inspected by the user Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

49 Linear Scalarization & Multiple Policies Definition The convex hull CH(Π) is the subset of Π for which there exists a w that maximizes the linearly scalarized value: CH(Π) = {π : π Π w (π Π) w V π w V π } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

50 Linear Scalarization & Multiple Policies Definition The convex hull CH(Π) is the subset of Π for which there exists a w that maximizes the linearly scalarized value: CH(Π) = {π : π Π w (π Π) w V π w V π } Definition The convex coverage set CCS(Π) is a subset of CH(Π) that, for every w, contains a policy whose linearly scalarized value is maximal, i.e., CCS(Π) CH(Π) ( w)( π) (π CCS(Π) (π Π) w V π w V π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

51 Visualization Objective Space Weight Space V V w V w 1 V w = w 0 V 0 + w 1 V 1, w 0 = 1 w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

52 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: mining gold and silver Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

53 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

54 Monotonically Increasing Scalarization Functions Mining example: V π 1 = (3, 0), V π 2 = (0, 3), V π 3 = (1, 1) Choosing V π 3 implies nonlinear scalarization function Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

55 Monotonically Increasing Scalarization Functions Definition A scalarization function is strictly monotonically increasing if changing a policy such that its value increases in one or more objectives, without decreasing in any other objectives, also increases the scalarized value: ( i Vi π Vi π i Vi π > Vi π ) ( w Vw π > Vw π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

56 Monotonically Increasing Scalarization Functions Definition A scalarization function is strictly monotonically increasing if changing a policy such that its value increases in one or more objectives, without decreasing in any other objectives, also increases the scalarized value: ( i Vi π Vi π i Vi π > Vi π ) ( w Vw π > Vw π ) Definition A policy π Pareto-dominates another policy π when its value is at least as high in all objectives and strictly higher in at least one objective: V π P V π i Vi π Vi π i Vi π > Vi π A policy is Pareto optimal if no policy Pareto-dominates it. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

57 Nonlinear Scalarization Can Destroy Additivity Nonlinear scalarization and expectation do not commute: Vw π = f (V π, w) = f (E[ γ k r t+k+1 ], w) E[ γ k f (r t+k+1, w)] k=0 k=0 Bellman-based methods not applicable Local action selection no longer yields an optimal policy: π (s) arg max V (s) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

58 Deterministic vs. Stochastic Policies Stochastic are fine in most settings Sometimes inappropriate, e.g., medical treatment In MDPs, requiring is not restrictive Optimal value attainable with stationary policy: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

59 Deterministic vs. Stochastic Policies Stochastic are fine in most settings Sometimes inappropriate, e.g., medical treatment In MDPs, requiring is not restrictive Optimal value attainable with stationary policy: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Similar for MOMDPs with linear scalarization MOMDPs with nonlinear scalarization: Stochastic may be preferable if allowed Nonstationary may be preferable otherwise Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

60 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

61 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

62 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ π ns alternates between a 1 and a 2, starting with a 1 : ( 3 V πns = 1 γ 2, 3γ ) 1 γ 2 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

63 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ π ns alternates between a 1 and a 2, starting with a 1 : ( 3 V πns = 1 γ 2, 3γ ) 1 γ 2 Thus π ns P π 3 when γ 0.5, e.g., γ = 0.5 and f (V π ) = V π 1 V π 2 : V π 1 = V π 2 = 0, V π 3 = 4, V πns = 8 V gamma=0.3 V gamma=0.5 V gamma=0.7 V gamma= V V V V 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

64 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

65 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

66 Mixture Policies A mixture policy π m selects i-th policy from set of N with probability p i, where N i=0 p i = 1 Values are convex combination of values of constituent In White s example, replace π ns by π m : V πm = p 1 V π 1 + (1 p 1 )V π 2 = ( 3p1 1 γ, 3(1 p ) 1) 1 γ Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

67 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: studying vs. networking Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

68 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

69 Pareto Sets Definition The Pareto front is the set of all that are not Pareto dominated: PF (Π) = {π : π Π (π Π), V π P V π } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

70 Pareto Sets Definition The Pareto front is the set of all that are not Pareto dominated: PF (Π) = {π : π Π (π Π), V π P V π } Definition A Pareto coverage set is a subset of PF (Π) such that, for every π Π, it contains a policy that either dominates π or has equal value to π : ( ) PCS(Π) PF (Π) (π Π)( π) π PCS(Π) (V π P V π V π =V π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

71 Visualization Objective Space Weight Space V V w V w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

72 Visualization Objective Space Weight Space V V w V w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

73 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy (again) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

74 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy (again) Note: only setting that case requires a Pareto front! Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

75 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

76 Mixture Policies A CCS(Π DS ) is also a CCS(Π) but not necessarily a PCS(Π) But a PCS(Π) can be made by mixing in a CCS(Π DS ) V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

77 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: studying vs. networking (again) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

78 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

79 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

80 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Known transition and reward functions planning Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

81 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Known transition and reward functions planning Unknown transition and reward functions learning Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

82 Background: Value Iteration Initial estimate value estimate V 0 (s) Apply Bellman backups until convergence: [ ] V k+1 (s) max T (s, a, s ) R(s, a, s ) + γv k (s ) a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

83 Background: Value Iteration Initial estimate value estimate V 0 (s) Apply Bellman backups until convergence: [ ] V k+1 (s) max T (s, a, s ) R(s, a, s ) + γv k (s ) a s Can also be written: V k+1 (s) max Q k+1 (s, a), a Q k+1 (s, a) [ ] T (s, a, s ) R(s, a, s ) + γv k (s ) s Optimal policy is easy to retrieve from Q-table Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

84 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

85 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

86 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

87 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Apply standard VI Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

88 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Apply standard VI Does not return multi-objective value Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

89 Scalarized Value Iteration Adapt Bellman backup: w V k+1 (s) max w Q k+1 (s, a), a Q k+1 (s, a) [ ] T (s, a, s ) R(s, a, s ) + γv k (s ) s Returns multi-objective value. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

90 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

91 Inner versus Outer Loop... max max max SO method MO inner loop MO outer loop Inner loop Adapting operators of single objective method (e.g., value iteration) Series of multi-objective operations (e.g. Bellman backups) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

92 Inner versus Outer Loop... max max max SO method MO inner loop MO outer loop Inner loop Adapting operators of single objective method (e.g., value iteration) Series of multi-objective operations (e.g. Bellman backups) Outer loop Single objective method as subroutine Series of single-objective problems Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

93 Inner Loop: Convex Hull Value Iteration Barrett & Narayanan (2008) Idea: do the backup for all w in parallel New backup operators must handle sets of values. At backup: generate all value vectors for s, a-pair prune away those that are not optimal for any w Only need stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

94 Inner Loop: Convex Hull Value Iteration Initial set of value vectors, e.g., V 0 (s) = {(0, 0)} All possible value vectors: Q k+1 (s, a) T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] s where u + V = {u + v : v V }, and U V = {u + v : u U v V } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

95 Inner Loop: Convex Hull Value Iteration Initial set of value vectors, e.g., V 0 (s) = {(0, 0)} All possible value vectors: Q k+1 (s, a) T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] s where u + V = {u + v : v V }, and U V = {u + v : u U v V } Prune value vectors ( ) V k+1 (s) CPrune Q k+1 (s, a) CPrune uses linear programs (e.g., Roijers et al. (2015)) a Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

96 CHVI Example Extremely simple MOMDP: 1 state: s; 2 actions: a 1 and a 2 Deterministic transitions Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 V 0 (s) = {(0, 0)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

97 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

98 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

99 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} V 1 (s) = CPrune( a Q 1(s, a)) = {(2, 0), (0, 2)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

100 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

101 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

102 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V w V 2 (s) = CPrune({(3, 0), (2, 1), (1, 2), (0, 3)}) w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

103 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V 2 (s) = {(3, 0), (0, 3)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

104 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 3: V 2 (s) = {(3, 0), (0, 3)} Q 3 (s, a 1 ) = {(3.5, 0), (2, 1.5)} Q 3 (s, a 2 ) = {(1.5, 2), (0, 3.5)} V w V 3 (s) = CPrune({(3.5, 0), (2, 1.5), (1.5, 2), (0, 3.5)}) = {(3.5, 0), (0, 3.5)} w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

105 Convex Hull Value Iteration CPrune retains at least one optimal vector for each w Therefore, V w that would have been computed by VI is kept CHVI does not retain excess value vectors Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

106 Convex Hull Value Iteration CPrune retains at least one optimal vector for each w Therefore, V w that would have been computed by VI is kept CHVI does not retain excess value vectors CHVI generates a lot of excess value vectors Removal with linear programs (CPrune) is expensive Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

107 Outer Loop... max max max SO method MO inner loop MO outer loop Repeatly calls a single-objective solver Generic multi-objective method multi-objective coordination graphs multi-objective (multi-agent) MDPs multi-objective partially observable MDPs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

108 Outer Loop: Optimistic Linear Support Optimistic linear support (OLS) adapts and improves linear support for POMDPs (Cheng (1988)) Solves scalarized instances for specific w Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

109 Outer Loop: Optimistic Linear Support Optimistic linear support (OLS) adapts and improves linear support for POMDPs (Cheng (1988)) Solves scalarized instances for specific w Terminates after checking only a finite number of weights Returns exact CCS Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

110 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

111 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

112 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

113 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

114 Optimistic Linear Support u w (1,8) w c Δ (7,2) u w (1,8) (7,2) (5,6) w w 1 Priority queue, Q, for corner weights Maximal possible improvement as priority Stop when < ε Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

115 Optimistic Linear Support Solving scalarized instance not always possible ε-approximate solver Produces an ε-ccs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

116 Reusing Value Functions Observation: when w are close so are optimal V Hot start with V(s) from close w Optimistic Linear Support with Alpha Reuse (OLSAR) ε 1e-04 1e-01 1e+02 RS RAR OLS+ OLSAR runtime (ms) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

117 Comparing Inner and Outer Loop OLS (outer loop) advantages Any (cooperative) multi-objective decision problem Any single-objective / scalarized subroutine Inherits quality guarantees Faster for small and medium numbers of objectives Inner loop faster for large numbers of objectives Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

118 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

119 Inner Loop: Pareto-Q Similar to CHVI Different pruning operator Pairwise comparisons: V(s) P V (s) Comparisons cheaper but much more vectors Converges to correct Pareto coverage set (White (1982)) Executing a policy is no longer trivial (Van Moffaert & Nowé (2014)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

120 Inner Loop: Pareto-Q Compute all possible vectors Q k+1 (s, a) s T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] where u + V = {u + v : v V }, U V = {u + v : u U v V } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

121 Inner Loop: Pareto-Q Compute all possible vectors Q k+1 (s, a) s T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] where u + V = {u + v : v V }, U V = {u + v : u U v V } Take the union across a Prune Pareto-dominated vectors ( ) V k+1 (s) PPrune Q k+1 (s, a) a Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

122 Pareto-Q Example Extremely simple MOMDP: 1 state: s; 2 actions: a 1 and a 2 Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 V 0 (s) = {(0, 0)} V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

123 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} V 1 (s) = PPrune( a Q 1(s, a)) = {(2, 0), (0, 2)} V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

124 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V 1 V 2 (s) = PPrune({(3, 0), (2, 1), (1, 2), (0, 3)}) V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

125 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 2 (s) = {(3, 0), (2, 1), (1, 2), (0, 3)} Q 3 (s, a 1 ) = {(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5)} Q 3 (s, a 2 ) = {(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)} V V 3 (s) = PPrune({(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5), (1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)}) V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

126 Inner Loop: Pareto-Q PCS size can explode No longer Cannot read policy from Q-table Except for first action Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

127 Inner Loop: Pareto-Q PCS size can explode No longer Cannot read policy from Q-table Except for first action Track a policy during execution (Van Moffaert & Nowé (2014)) For transitions: s, a s From Qt=0 (s, a) substract R(s, a) Correct for discount factor V t=1 (s ) Find Vt=1 (s ) in Q-tables for s For stochastic transitions, see Kristoff van Moffaert s PhD thesis Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

128 Outer Loop?... max max max SO method MO inner loop MO outer loop Outer loop very difficult: Vw π = f (E[ γ k r t+k+1 ], w) E[ γ k f (r t+k+1, w)] k=0 Maximization does not do the trick! k=0 Heuristic with non-linear f (Van Moffaert, Drugan, Nowé (2013)) Not guaranteed to find optimal policy, or converge Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

129 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

130 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

131 Non-linear f and stationary Scalarized returns are not additive. π : S A CON-MODP (Wiering & De Jong (2007)) Does multi-objective Bellman backups, like Pareto-Q Additional CONsistency operator to force stationarity Guaranteed to converge to correct PCS Computationally expensive Heuristic method: Pareto Local Policy Search (Kooijman et al. (2015)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

132 Learning Scenarios learning phase selection phase execution phase dataset Learning Algorithm coverage set Scalarization selected policy weights Model-based learning (Wiering, Withagen & Drugan (2014)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

133 Learning Methods Q-Learning schema: Q new (s, a) (1 α)q old (s, a) α ( R(s, a, s ) + γv old (s ) ) CCS: Parallel Q-learning (Hiraoka, Yoshida & Mashima (2009)) Det. Non-stat. PCS: Pareto-Q Learning (Van Moffaert & Nowé (2014)) Det. Stat. PCS: schema not appliable Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

134 Learning Methods Q-Learning schema: Q new (s, a) (1 α)q old (s, a) α ( R(s, a, s ) + γv old (s ) ) CCS: Parallel Q-learning (Hiraoka, Yoshida & Mashima (2009)) Det. Non-stat. PCS: Pareto-Q Learning (Van Moffaert & Nowé (2014)) Det. Stat. PCS: schema not appliable Model-based learning for Det. Stat. PCS (Wiering, Withagen & Drugan (2014)) Various evolutionary approaches (e.g., Handa (2009)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

135 Learning Scenarios states, actions, rewards Learning Algorithm partial coverage set Policy Selector selected policy Environment weights Dynamic preferences (Natarajan & Tadepalli (2005)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

136 Other Decision Problems Taxonomy applies to cooperative decision problems Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

137 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

138 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Multi-objective POMDPs (Soh & Demeris (2011); Roijers, Whiteson & Oliehoek (2015b); Wray & Zilberstein (2015)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

139 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Multi-objective POMDPs (Soh & Demeris (2011); Roijers, Whiteson & Oliehoek (2015b); Wray & Zilberstein (2015)) Multi-objective bandit problems (Drugan & Nowé (2013)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

140 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

141 Medical Applications Treatment planning (Lizotte (2010, 2012)) Maximizing effectiveness of the treatment Minimizing the severity of the side-effects Finite-horizon MOMDPs Deterministic Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

Medical Applications Anthrax response (Soh & Demiris (2011)) Minimizing loss of life Minimizing number of false alarms Minimizing cost of investigation Partial

142 Medical Applications Anthrax response (Soh & Demiris (2011)) Minimizing loss of life Minimizing number of false alarms Minimizing cost of investigation Partial observability (MOPOMDP) Finite-state controllers Evolutionary method Pareto coverage set Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

Medical Applications Control system for wheelchairs (Soh & Demiris (2011)) I I I Maximizing safety Maximizing speed Minimizing power consumption.

143 Medical Applications Control system for wheelchairs (Soh & Demiris (2011)) I I I Maximizing safety Maximizing speed Minimizing power consumption. Partial observability (MOPOMDP) Finite-state controllers Evolutionary method Pareto coverage set Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

144 Environmental Applications Environmental control of a building (Kwak et al. (2012)) Minimizing energy consumption Maximizing comfort Reduce energy consumption by 30% w.r.t. old system While maintaining comfort Groups of people with different preferences Smart grids Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

145 Broader Application Probabilistic Planning is Multi-objective Bryce et al. (2007) The expected return is not enough Cost of a plan Probability of success of a plan Non-goal terminal states Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

146 Closing Consider multiple objectives most problems have them a priori scalarization can be bad Derive your solution set Pareto front often not necessary Promising applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Diederik M. Roijers Vrije Universiteit Brussel & Vrije Universiteit Amsterdam Erwin Walraven Delft University of Technology