Multi-Objective Decision Making

Size: px
Start display at page:

Download "Multi-Objective Decision Making"

Transcription

1 Multi-Objective Decision Making Shimon Whiteson & Diederik M. Roijers Department of Computer Science University of Oxford July 25, 2016 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

2 Schedule 14:30-15:10: Motivation & Concepts (Shimon) 15:10-15:20: Short Break 15:20-16:00: Motivation & Concepts cont d (Shimon) 16:00-16:30: Coffee Break 16:30-17:10: Methods & Applications (Diederik) 17:10-17:20: Short Break 17:20-18:00: Methods & Applications cont d (Diederik) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

3 Note Get the latest version of the slides at: This tutorial is based on our survey article: Diederik Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A Survey of Multi-Objective Sequential Decision-Making. Journal of Artificial Intelligence Research, 48:67 113, and Diederik s dissertation: Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

4 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

5 Medical Treatment Chance of being cured, having side effects, or dying Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

6 Traffic Coordination Latency, throughput, fairness, environmental impact, etc. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

7 Mining Commodities Gold collected, silver collected village mine [Roijers et al. 2013, 2014] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

8 Grid World Getting the treasure, minimising fuel costs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

9 Do We Need Multi-Objective Models? Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

10 Do We Need Multi-Objective Models? Sutton s Reward Hypothesis: All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Source: Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

11 Do We Need Multi-Objective Models? Sutton s Reward Hypothesis: All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Source: V : Π R V π = E π [ t r t] π = max π V π Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

12 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

13 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

14 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Scalarization function projects multi-objective value to a scalar: Vw π = f (V π, w) Linear case: V π w = n i=1 w i V π i = w V π A priori prioritization of the objectives Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

15 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Scalarization function projects multi-objective value to a scalar: Vw π = f (V π, w) Linear case: V π w = n i=1 w i V π i = w V π A priori prioritization of the objectives The weak argument is necessary but not sufficient Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

16 Why Multi-Objective Decision Making? The strong argument: a priori scalarization is sometimes impossible, infeasible, or undesirable Instead produce the coverage set of undominated solutions Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

17 Why Multi-Objective Decision Making? The strong argument: a priori scalarization is sometimes impossible, infeasible, or undesirable Instead produce the coverage set of undominated solutions Unknown-weights scenario Weights known in execution phase but not in planning phase Example: mining commodities [Roijers et al. 2013] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

18 Why Multi-Objective Decision Making? Decision-support scenario Quantifying priorities is infeasible Choosing between options is easier Example: medical treatment Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

19 Why Multi-Objective Decision Making? Decision-support scenario Quantifying priorities is infeasible Choosing between options is easier Example: medical treatment Known-weights scenario: scalarization yields intractable problem Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

20 Summary of Motivation Multi-objective methods are useful because many problems are naturally characterized by multiple objectives and cannot be easily scalarized a priori. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

21 Summary of Motivation Multi-objective methods are useful because many problems are naturally characterized by multiple objectives and cannot be easily scalarized a priori. The burden of proof rests with the a priori scalarization, not with the multi-objective modeling. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

22 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

23 Markov Decision Process (MDP) A single-objective MDP is a tuple S, A, T, R, µ, γ where: S is a finite set of states A is a finite set of actions T : S A S [0, 1] is a transition function R : S A S R is a reward function µ : S [0, 1] is a probability distribution over initial states γ [0, 1) is a discount factor (figure from Poole & Mackworth, Artificial Intelligence: Foundations of Computational Agents, 2010) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

24 Returns & Policies Goal: maximize expected return, which is typically additive: R t = γ k r t+k+1 k=0 A stationary policy conditions only on the current state: π : S A [0, 1] A stationary policy maps states directly to actions: π : S A Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

25 Value Functions in MDPs A state-independent value function V π specifies the expected return when following π from the initial state: A state value function of a policy π: V π = E[R 0 π] (1) V π (s) = E[R t π, s t = s] The Bellman equation restates this expectation recursively for stationary : V π (s) = a π(s, a) s T (s, a, s )[R(s, a, s ) + γv π (s )] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

26 Optimality in MDPs Theorem For any additive infinite-horizon single-objective MDP, there exists a stationary optimal policy [Howard 1960] All optimal share the same optimal value function: V (s) = max π V (s) = max a V π (s) s T (s, a, s )[R(s, a, s ) + γv (s )] Extract the optimal policy using local action selection: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

27 Multi-Objective MDP (MOMDP) Vector-valued reward and value: R : S A S R n V π = E[ γ k r k+1 π] k=0 V π (s) = E[ γ k r t+k+1 π, s t = s] k=0 V π (s) imposes only a partial ordering, e.g., i (s) > Vi π (s) but Vj π (s) < Vj π (s). V π Definition of optimality no longer clear Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

28 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

29 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

30 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

31 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Planning phase: find set of containing optimal solution for each possible w; if w unknown, size of set generally > 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

32 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Planning phase: find set of containing optimal solution for each possible w; if w unknown, size of set generally > 1 Deduce optimal solution set from three factors: 1 Multi-objective scenario 2 Properties of scalarization function 3 Allowable Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

33 Three Factors 1 Multi-objective scenario Known weights single policy Unknown weights or decision support multiple 2 Properties of scalarization function Linear Monotonically increasing 3 Allowable Deterministic Stochastic Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

34 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

35 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

36 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

37 Linear Scalarization Functions Computes inner product of w and V π : V π w = n i=1 w i V π i = w V π, w R n w i quantifies importance of i-th objective Simple and intuitive, e.g., when utility translates to money: revenue = #cans ppc + #bottles ppb Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

38 Linear Scalarization Functions Computes inner product of w and V π : V π w = n i=1 w i V π i = w V π, w R n w i quantifies importance of i-th objective Simple and intuitive, e.g., when utility translates to money: revenue = #cans ppc + #bottles ppb Vw π typically constrained to be a convex combination: i w i 0, w i = 1 utility = #cans i ppc ppc + ppb + #bottles ppb ppc + ppb Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

39 Linear Scalarization & Single Policy No special methods required: just apply f to each reward vector Inner product distributes over addition yielding a normal MDP: Vw π = w V π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )] k=0 k=0 Apply standard methods to an MDP with: R(s, a, s ) = w R(s, a, s ), (2) yielding a single determinstic stationary policy Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

40 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: collecting bottles and cans Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

41 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: collecting bottles and cans Note: only cell in taxonomy that does not require multi-objective methods Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

42 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

43 Multiple Policies Unknown weights or decision support multiple During planning w is unknown Size of solution set is generally > 1 Set should not contain suboptimal for all w Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

44 Undominated & Coverage Sets Definition The undominated set U(Π), is the subset of all possible Π for which there exists a w for which the scalarized value is maximal, U(Π) = {π : π Π w (π Π) V π w V π w } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

45 Undominated & Coverage Sets Definition The undominated set U(Π), is the subset of all possible Π for which there exists a w for which the scalarized value is maximal, U(Π) = {π : π Π w (π Π) V π w V π w } Definition A coverage set CS(Π) is a subset of U(Π) that, for every w, contains a policy with maximal scalarized value, i.e., ( ) CS(Π) U(Π) ( w)( π) π CS(Π) (π Π) Vw π Vw π Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

46 Example Vw π w = true w = false π = π π = π π = π π = π One binary weight feature: only two possible weights Weights are not objectives but two possible scalarizations Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

47 Example Vw π w = true w = false π = π π = π π = π π = π One binary weight feature: only two possible weights Weights are not objectives but two possible scalarizations U(Π) = {π 1, π 2, π 3 } but CS(Π) = {π 1, π 2 } or {π 2, π 3 } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

48 Execution Phase Single policy selected from CS(Π) and executed Unknown weights: weights revealed and maximizing policy selected: π = arg max Vw π π CS(Π) Decision support: CS(Π) is manually inspected by the user Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

49 Linear Scalarization & Multiple Policies Definition The convex hull CH(Π) is the subset of Π for which there exists a w that maximizes the linearly scalarized value: CH(Π) = {π : π Π w (π Π) w V π w V π } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

50 Linear Scalarization & Multiple Policies Definition The convex hull CH(Π) is the subset of Π for which there exists a w that maximizes the linearly scalarized value: CH(Π) = {π : π Π w (π Π) w V π w V π } Definition The convex coverage set CCS(Π) is a subset of CH(Π) that, for every w, contains a policy whose linearly scalarized value is maximal, i.e., CCS(Π) CH(Π) ( w)( π) (π CCS(Π) (π Π) w V π w V π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

51 Visualization Objective Space Weight Space V V w V w 1 V w = w 0 V 0 + w 1 V 1, w 0 = 1 w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

52 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: mining gold and silver Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

53 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

54 Monotonically Increasing Scalarization Functions Mining example: V π 1 = (3, 0), V π 2 = (0, 3), V π 3 = (1, 1) Choosing V π 3 implies nonlinear scalarization function Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

55 Monotonically Increasing Scalarization Functions Definition A scalarization function is strictly monotonically increasing if changing a policy such that its value increases in one or more objectives, without decreasing in any other objectives, also increases the scalarized value: ( i Vi π Vi π i Vi π > Vi π ) ( w Vw π > Vw π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

56 Monotonically Increasing Scalarization Functions Definition A scalarization function is strictly monotonically increasing if changing a policy such that its value increases in one or more objectives, without decreasing in any other objectives, also increases the scalarized value: ( i Vi π Vi π i Vi π > Vi π ) ( w Vw π > Vw π ) Definition A policy π Pareto-dominates another policy π when its value is at least as high in all objectives and strictly higher in at least one objective: V π P V π i Vi π Vi π i Vi π > Vi π A policy is Pareto optimal if no policy Pareto-dominates it. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

57 Nonlinear Scalarization Can Destroy Additivity Nonlinear scalarization and expectation do not commute: Vw π = f (V π, w) = f (E[ γ k r t+k+1 ], w) E[ γ k f (r t+k+1, w)] k=0 k=0 Bellman-based methods not applicable Local action selection no longer yields an optimal policy: π (s) arg max V (s) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

58 Deterministic vs. Stochastic Policies Stochastic are fine in most settings Sometimes inappropriate, e.g., medical treatment In MDPs, requiring is not restrictive Optimal value attainable with stationary policy: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

59 Deterministic vs. Stochastic Policies Stochastic are fine in most settings Sometimes inappropriate, e.g., medical treatment In MDPs, requiring is not restrictive Optimal value attainable with stationary policy: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Similar for MOMDPs with linear scalarization MOMDPs with nonlinear scalarization: Stochastic may be preferable if allowed Nonstationary may be preferable otherwise Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

60 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

61 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

62 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ π ns alternates between a 1 and a 2, starting with a 1 : ( 3 V πns = 1 γ 2, 3γ ) 1 γ 2 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

63 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ π ns alternates between a 1 and a 2, starting with a 1 : ( 3 V πns = 1 γ 2, 3γ ) 1 γ 2 Thus π ns P π 3 when γ 0.5, e.g., γ = 0.5 and f (V π ) = V π 1 V π 2 : V π 1 = V π 2 = 0, V π 3 = 4, V πns = 8 V gamma=0.3 V gamma=0.5 V gamma=0.7 V gamma= V V V V 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

64 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

65 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

66 Mixture Policies A mixture policy π m selects i-th policy from set of N with probability p i, where N i=0 p i = 1 Values are convex combination of values of constituent In White s example, replace π ns by π m : V πm = p 1 V π 1 + (1 p 1 )V π 2 = ( 3p1 1 γ, 3(1 p ) 1) 1 γ Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

67 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: studying vs. networking Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

68 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

69 Pareto Sets Definition The Pareto front is the set of all that are not Pareto dominated: PF (Π) = {π : π Π (π Π), V π P V π } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

70 Pareto Sets Definition The Pareto front is the set of all that are not Pareto dominated: PF (Π) = {π : π Π (π Π), V π P V π } Definition A Pareto coverage set is a subset of PF (Π) such that, for every π Π, it contains a policy that either dominates π or has equal value to π : ( ) PCS(Π) PF (Π) (π Π)( π) π PCS(Π) (V π P V π V π =V π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

71 Visualization Objective Space Weight Space V V w V w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

72 Visualization Objective Space Weight Space V V w V w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

73 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy (again) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

74 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy (again) Note: only setting that case requires a Pareto front! Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

75 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

76 Mixture Policies A CCS(Π DS ) is also a CCS(Π) but not necessarily a PCS(Π) But a PCS(Π) can be made by mixing in a CCS(Π DS ) V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

77 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: studying vs. networking (again) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

78 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

79 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

80 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Known transition and reward functions planning Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

81 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Known transition and reward functions planning Unknown transition and reward functions learning Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

82 Background: Value Iteration Initial estimate value estimate V 0 (s) Apply Bellman backups until convergence: [ ] V k+1 (s) max T (s, a, s ) R(s, a, s ) + γv k (s ) a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

83 Background: Value Iteration Initial estimate value estimate V 0 (s) Apply Bellman backups until convergence: [ ] V k+1 (s) max T (s, a, s ) R(s, a, s ) + γv k (s ) a s Can also be written: V k+1 (s) max Q k+1 (s, a), a Q k+1 (s, a) [ ] T (s, a, s ) R(s, a, s ) + γv k (s ) s Optimal policy is easy to retrieve from Q-table Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

84 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

85 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

86 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

87 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Apply standard VI Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

88 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Apply standard VI Does not return multi-objective value Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

89 Scalarized Value Iteration Adapt Bellman backup: w V k+1 (s) max w Q k+1 (s, a), a Q k+1 (s, a) [ ] T (s, a, s ) R(s, a, s ) + γv k (s ) s Returns multi-objective value. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

90 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

91 Inner versus Outer Loop... max max max SO method MO inner loop MO outer loop Inner loop Adapting operators of single objective method (e.g., value iteration) Series of multi-objective operations (e.g. Bellman backups) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

92 Inner versus Outer Loop... max max max SO method MO inner loop MO outer loop Inner loop Adapting operators of single objective method (e.g., value iteration) Series of multi-objective operations (e.g. Bellman backups) Outer loop Single objective method as subroutine Series of single-objective problems Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

93 Inner Loop: Convex Hull Value Iteration Barrett & Narayanan (2008) Idea: do the backup for all w in parallel New backup operators must handle sets of values. At backup: generate all value vectors for s, a-pair prune away those that are not optimal for any w Only need stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

94 Inner Loop: Convex Hull Value Iteration Initial set of value vectors, e.g., V 0 (s) = {(0, 0)} All possible value vectors: Q k+1 (s, a) T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] s where u + V = {u + v : v V }, and U V = {u + v : u U v V } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

95 Inner Loop: Convex Hull Value Iteration Initial set of value vectors, e.g., V 0 (s) = {(0, 0)} All possible value vectors: Q k+1 (s, a) T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] s where u + V = {u + v : v V }, and U V = {u + v : u U v V } Prune value vectors ( ) V k+1 (s) CPrune Q k+1 (s, a) CPrune uses linear programs (e.g., Roijers et al. (2015)) a Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

96 CHVI Example Extremely simple MOMDP: 1 state: s; 2 actions: a 1 and a 2 Deterministic transitions Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 V 0 (s) = {(0, 0)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

97 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

98 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

99 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} V 1 (s) = CPrune( a Q 1(s, a)) = {(2, 0), (0, 2)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

100 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

101 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

102 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V w V 2 (s) = CPrune({(3, 0), (2, 1), (1, 2), (0, 3)}) w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

103 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V 2 (s) = {(3, 0), (0, 3)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

104 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 3: V 2 (s) = {(3, 0), (0, 3)} Q 3 (s, a 1 ) = {(3.5, 0), (2, 1.5)} Q 3 (s, a 2 ) = {(1.5, 2), (0, 3.5)} V w V 3 (s) = CPrune({(3.5, 0), (2, 1.5), (1.5, 2), (0, 3.5)}) = {(3.5, 0), (0, 3.5)} w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

105 Convex Hull Value Iteration CPrune retains at least one optimal vector for each w Therefore, V w that would have been computed by VI is kept CHVI does not retain excess value vectors Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

106 Convex Hull Value Iteration CPrune retains at least one optimal vector for each w Therefore, V w that would have been computed by VI is kept CHVI does not retain excess value vectors CHVI generates a lot of excess value vectors Removal with linear programs (CPrune) is expensive Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

107 Outer Loop... max max max SO method MO inner loop MO outer loop Repeatly calls a single-objective solver Generic multi-objective method multi-objective coordination graphs multi-objective (multi-agent) MDPs multi-objective partially observable MDPs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

108 Outer Loop: Optimistic Linear Support Optimistic linear support (OLS) adapts and improves linear support for POMDPs (Cheng (1988)) Solves scalarized instances for specific w Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

109 Outer Loop: Optimistic Linear Support Optimistic linear support (OLS) adapts and improves linear support for POMDPs (Cheng (1988)) Solves scalarized instances for specific w Terminates after checking only a finite number of weights Returns exact CCS Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

110 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

111 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

112 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

113 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

114 Optimistic Linear Support u w (1,8) w c Δ (7,2) u w (1,8) (7,2) (5,6) w w 1 Priority queue, Q, for corner weights Maximal possible improvement as priority Stop when < ε Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

115 Optimistic Linear Support Solving scalarized instance not always possible ε-approximate solver Produces an ε-ccs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

116 Reusing Value Functions Observation: when w are close so are optimal V Hot start with V(s) from close w Optimistic Linear Support with Alpha Reuse (OLSAR) ε 1e-04 1e-01 1e+02 RS RAR OLS+ OLSAR runtime (ms) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

117 Comparing Inner and Outer Loop OLS (outer loop) advantages Any (cooperative) multi-objective decision problem Any single-objective / scalarized subroutine Inherits quality guarantees Faster for small and medium numbers of objectives Inner loop faster for large numbers of objectives Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

118 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

119 Inner Loop: Pareto-Q Similar to CHVI Different pruning operator Pairwise comparisons: V(s) P V (s) Comparisons cheaper but much more vectors Converges to correct Pareto coverage set (White (1982)) Executing a policy is no longer trivial (Van Moffaert & Nowé (2014)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

120 Inner Loop: Pareto-Q Compute all possible vectors Q k+1 (s, a) s T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] where u + V = {u + v : v V }, U V = {u + v : u U v V } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

121 Inner Loop: Pareto-Q Compute all possible vectors Q k+1 (s, a) s T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] where u + V = {u + v : v V }, U V = {u + v : u U v V } Take the union across a Prune Pareto-dominated vectors ( ) V k+1 (s) PPrune Q k+1 (s, a) a Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

122 Pareto-Q Example Extremely simple MOMDP: 1 state: s; 2 actions: a 1 and a 2 Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 V 0 (s) = {(0, 0)} V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

123 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} V 1 (s) = PPrune( a Q 1(s, a)) = {(2, 0), (0, 2)} V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

124 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V 1 V 2 (s) = PPrune({(3, 0), (2, 1), (1, 2), (0, 3)}) V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

125 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 2 (s) = {(3, 0), (2, 1), (1, 2), (0, 3)} Q 3 (s, a 1 ) = {(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5)} Q 3 (s, a 2 ) = {(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)} V V 3 (s) = PPrune({(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5), (1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)}) V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

126 Inner Loop: Pareto-Q PCS size can explode No longer Cannot read policy from Q-table Except for first action Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

127 Inner Loop: Pareto-Q PCS size can explode No longer Cannot read policy from Q-table Except for first action Track a policy during execution (Van Moffaert & Nowé (2014)) For transitions: s, a s From Qt=0 (s, a) substract R(s, a) Correct for discount factor V t=1 (s ) Find Vt=1 (s ) in Q-tables for s For stochastic transitions, see Kristoff van Moffaert s PhD thesis Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

128 Outer Loop?... max max max SO method MO inner loop MO outer loop Outer loop very difficult: Vw π = f (E[ γ k r t+k+1 ], w) E[ γ k f (r t+k+1, w)] k=0 Maximization does not do the trick! k=0 Heuristic with non-linear f (Van Moffaert, Drugan, Nowé (2013)) Not guaranteed to find optimal policy, or converge Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

129 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

130 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

131 Non-linear f and stationary Scalarized returns are not additive. π : S A CON-MODP (Wiering & De Jong (2007)) Does multi-objective Bellman backups, like Pareto-Q Additional CONsistency operator to force stationarity Guaranteed to converge to correct PCS Computationally expensive Heuristic method: Pareto Local Policy Search (Kooijman et al. (2015)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

132 Learning Scenarios learning phase selection phase execution phase dataset Learning Algorithm coverage set Scalarization selected policy weights Model-based learning (Wiering, Withagen & Drugan (2014)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

133 Learning Methods Q-Learning schema: Q new (s, a) (1 α)q old (s, a) α ( R(s, a, s ) + γv old (s ) ) CCS: Parallel Q-learning (Hiraoka, Yoshida & Mashima (2009)) Det. Non-stat. PCS: Pareto-Q Learning (Van Moffaert & Nowé (2014)) Det. Stat. PCS: schema not appliable Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

134 Learning Methods Q-Learning schema: Q new (s, a) (1 α)q old (s, a) α ( R(s, a, s ) + γv old (s ) ) CCS: Parallel Q-learning (Hiraoka, Yoshida & Mashima (2009)) Det. Non-stat. PCS: Pareto-Q Learning (Van Moffaert & Nowé (2014)) Det. Stat. PCS: schema not appliable Model-based learning for Det. Stat. PCS (Wiering, Withagen & Drugan (2014)) Various evolutionary approaches (e.g., Handa (2009)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

135 Learning Scenarios states, actions, rewards Learning Algorithm partial coverage set Policy Selector selected policy Environment weights Dynamic preferences (Natarajan & Tadepalli (2005)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

136 Other Decision Problems Taxonomy applies to cooperative decision problems Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

137 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

138 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Multi-objective POMDPs (Soh & Demeris (2011); Roijers, Whiteson & Oliehoek (2015b); Wray & Zilberstein (2015)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

139 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Multi-objective POMDPs (Soh & Demeris (2011); Roijers, Whiteson & Oliehoek (2015b); Wray & Zilberstein (2015)) Multi-objective bandit problems (Drugan & Nowé (2013)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

140 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

141 Medical Applications Treatment planning (Lizotte (2010, 2012)) Maximizing effectiveness of the treatment Minimizing the severity of the side-effects Finite-horizon MOMDPs Deterministic Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

142 Medical Applications Anthrax response (Soh & Demiris (2011)) Minimizing loss of life Minimizing number of false alarms Minimizing cost of investigation Partial observability (MOPOMDP) Finite-state controllers Evolutionary method Pareto coverage set Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

143 Medical Applications Control system for wheelchairs (Soh & Demiris (2011)) I I I Maximizing safety Maximizing speed Minimizing power consumption. Partial observability (MOPOMDP) Finite-state controllers Evolutionary method Pareto coverage set Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

144 Environmental Applications Environmental control of a building (Kwak et al. (2012)) Minimizing energy consumption Maximizing comfort Reduce energy consumption by 30% w.r.t. old system While maintaining comfort Groups of people with different preferences Smart grids Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

145 Broader Application Probabilistic Planning is Multi-objective Bryce et al. (2007) The expected return is not enough Cost of a plan Probability of success of a plan Non-goal terminal states Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

146 Closing Consider multiple objectives most problems have them a priori scalarization can be bad Derive your solution set Pareto front often not necessary Promising applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Diederik M. Roijers Vrije Universiteit Brussel & Vrije Universiteit Amsterdam Erwin Walraven Delft University of Technology

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Linear Support for Multi-Objective Coordination Graphs

Linear Support for Multi-Objective Coordination Graphs Linear Support for Multi-Objective Coordination Graphs Diederik M. Roijers Informatics Institute University of Amsterdam Amsterdam, the Netherlands d.m.roijers@uva.nl Shimon Whiteson Informatics Institute

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

CS 598 Statistical Reinforcement Learning. Nan Jiang

CS 598 Statistical Reinforcement Learning. Nan Jiang CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Point-Based Value Iteration for Constrained POMDPs

Point-Based Value Iteration for Constrained POMDPs Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Computing Possibly Optimal Solutions for Multi-Objective Constraint Optimisation with Tradeoffs

Computing Possibly Optimal Solutions for Multi-Objective Constraint Optimisation with Tradeoffs Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 215) Computing Possibly Optimal Solutions for Multi-Objective Constraint Optimisation with Tradeoffs Nic

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague /doku.php/courses/a4b33zui/start pagenda Previous lecture: individual rational

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems

Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems Eugenio Bargiacchi Timothy Verstraeten Diederik M. Roijers 2 Ann Nowé Hado van Hasselt 3 Abstract

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Towards Faster Planning with Continuous Resources in Stochastic Domains

Towards Faster Planning with Continuous Resources in Stochastic Domains Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage. Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.

More information

Multi Objective Optimization

Multi Objective Optimization Multi Objective Optimization Handout November 4, 2011 (A good reference for this material is the book multi-objective optimization by K. Deb) 1 Multiple Objective Optimization So far we have dealt with

More information

AM 121: Intro to Optimization Models and Methods: Fall 2018

AM 121: Intro to Optimization Models and Methods: Fall 2018 AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted

More information

Homework 2: MDPs and Search

Homework 2: MDPs and Search Graduate Artificial Intelligence 15-780 Homework 2: MDPs and Search Out on February 15 Due on February 29 Problem 1: MDPs [Felipe, 20pts] Figure 1: MDP for Problem 1. States are represented by circles

More information