Multi-Objective Decision Making
|
|
- Reginald Hill
- 5 years ago
- Views:
Transcription
1 Multi-Objective Decision Making Shimon Whiteson & Diederik M. Roijers Department of Computer Science University of Oxford July 25, 2016 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
2 Schedule 14:30-15:10: Motivation & Concepts (Shimon) 15:10-15:20: Short Break 15:20-16:00: Motivation & Concepts cont d (Shimon) 16:00-16:30: Coffee Break 16:30-17:10: Methods & Applications (Diederik) 17:10-17:20: Short Break 17:20-18:00: Methods & Applications cont d (Diederik) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
3 Note Get the latest version of the slides at: This tutorial is based on our survey article: Diederik Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A Survey of Multi-Objective Sequential Decision-Making. Journal of Artificial Intelligence Research, 48:67 113, and Diederik s dissertation: Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
4 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
5 Medical Treatment Chance of being cured, having side effects, or dying Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
6 Traffic Coordination Latency, throughput, fairness, environmental impact, etc. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
7 Mining Commodities Gold collected, silver collected village mine [Roijers et al. 2013, 2014] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
8 Grid World Getting the treasure, minimising fuel costs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
9 Do We Need Multi-Objective Models? Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
10 Do We Need Multi-Objective Models? Sutton s Reward Hypothesis: All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Source: Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
11 Do We Need Multi-Objective Models? Sutton s Reward Hypothesis: All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Source: V : Π R V π = E π [ t r t] π = max π V π Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
12 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
13 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
14 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Scalarization function projects multi-objective value to a scalar: Vw π = f (V π, w) Linear case: V π w = n i=1 w i V π i = w V π A priori prioritization of the objectives Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
15 Why Multi-Objective Decision Making? The weak argument: real-world problems are multi-objective! V : Π R n Objection: why not just scalarize? Scalarization function projects multi-objective value to a scalar: Vw π = f (V π, w) Linear case: V π w = n i=1 w i V π i = w V π A priori prioritization of the objectives The weak argument is necessary but not sufficient Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
16 Why Multi-Objective Decision Making? The strong argument: a priori scalarization is sometimes impossible, infeasible, or undesirable Instead produce the coverage set of undominated solutions Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
17 Why Multi-Objective Decision Making? The strong argument: a priori scalarization is sometimes impossible, infeasible, or undesirable Instead produce the coverage set of undominated solutions Unknown-weights scenario Weights known in execution phase but not in planning phase Example: mining commodities [Roijers et al. 2013] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
18 Why Multi-Objective Decision Making? Decision-support scenario Quantifying priorities is infeasible Choosing between options is easier Example: medical treatment Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
19 Why Multi-Objective Decision Making? Decision-support scenario Quantifying priorities is infeasible Choosing between options is easier Example: medical treatment Known-weights scenario: scalarization yields intractable problem Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
20 Summary of Motivation Multi-objective methods are useful because many problems are naturally characterized by multiple objectives and cannot be easily scalarized a priori. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
21 Summary of Motivation Multi-objective methods are useful because many problems are naturally characterized by multiple objectives and cannot be easily scalarized a priori. The burden of proof rests with the a priori scalarization, not with the multi-objective modeling. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
22 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
23 Markov Decision Process (MDP) A single-objective MDP is a tuple S, A, T, R, µ, γ where: S is a finite set of states A is a finite set of actions T : S A S [0, 1] is a transition function R : S A S R is a reward function µ : S [0, 1] is a probability distribution over initial states γ [0, 1) is a discount factor (figure from Poole & Mackworth, Artificial Intelligence: Foundations of Computational Agents, 2010) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
24 Returns & Policies Goal: maximize expected return, which is typically additive: R t = γ k r t+k+1 k=0 A stationary policy conditions only on the current state: π : S A [0, 1] A stationary policy maps states directly to actions: π : S A Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
25 Value Functions in MDPs A state-independent value function V π specifies the expected return when following π from the initial state: A state value function of a policy π: V π = E[R 0 π] (1) V π (s) = E[R t π, s t = s] The Bellman equation restates this expectation recursively for stationary : V π (s) = a π(s, a) s T (s, a, s )[R(s, a, s ) + γv π (s )] Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
26 Optimality in MDPs Theorem For any additive infinite-horizon single-objective MDP, there exists a stationary optimal policy [Howard 1960] All optimal share the same optimal value function: V (s) = max π V (s) = max a V π (s) s T (s, a, s )[R(s, a, s ) + γv (s )] Extract the optimal policy using local action selection: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
27 Multi-Objective MDP (MOMDP) Vector-valued reward and value: R : S A S R n V π = E[ γ k r k+1 π] k=0 V π (s) = E[ γ k r t+k+1 π, s t = s] k=0 V π (s) imposes only a partial ordering, e.g., i (s) > Vi π (s) but Vj π (s) < Vj π (s). V π Definition of optimality no longer clear Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
28 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
29 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
30 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
31 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Planning phase: find set of containing optimal solution for each possible w; if w unknown, size of set generally > 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
32 Axiomatic vs. Utility-Based Approach Axiomatic approach: define optimal solution set to be Pareto front Utility-based approach: Execution phase: select one policy maximizing scalar utility V π w, where w may be hidden or implicit Planning phase: find set of containing optimal solution for each possible w; if w unknown, size of set generally > 1 Deduce optimal solution set from three factors: 1 Multi-objective scenario 2 Properties of scalarization function 3 Allowable Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
33 Three Factors 1 Multi-objective scenario Known weights single policy Unknown weights or decision support multiple 2 Properties of scalarization function Linear Monotonically increasing 3 Allowable Deterministic Stochastic Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
34 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
35 Part 1: Motivation & Concepts Multi-Objective Motivation MDPs & MOMDPs Problem Taxonomy Solution Concepts Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
36 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
37 Linear Scalarization Functions Computes inner product of w and V π : V π w = n i=1 w i V π i = w V π, w R n w i quantifies importance of i-th objective Simple and intuitive, e.g., when utility translates to money: revenue = #cans ppc + #bottles ppb Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
38 Linear Scalarization Functions Computes inner product of w and V π : V π w = n i=1 w i V π i = w V π, w R n w i quantifies importance of i-th objective Simple and intuitive, e.g., when utility translates to money: revenue = #cans ppc + #bottles ppb Vw π typically constrained to be a convex combination: i w i 0, w i = 1 utility = #cans i ppc ppc + ppb + #bottles ppb ppc + ppb Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
39 Linear Scalarization & Single Policy No special methods required: just apply f to each reward vector Inner product distributes over addition yielding a normal MDP: Vw π = w V π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )] k=0 k=0 Apply standard methods to an MDP with: R(s, a, s ) = w R(s, a, s ), (2) yielding a single determinstic stationary policy Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
40 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: collecting bottles and cans Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
41 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: collecting bottles and cans Note: only cell in taxonomy that does not require multi-objective methods Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
42 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
43 Multiple Policies Unknown weights or decision support multiple During planning w is unknown Size of solution set is generally > 1 Set should not contain suboptimal for all w Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
44 Undominated & Coverage Sets Definition The undominated set U(Π), is the subset of all possible Π for which there exists a w for which the scalarized value is maximal, U(Π) = {π : π Π w (π Π) V π w V π w } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
45 Undominated & Coverage Sets Definition The undominated set U(Π), is the subset of all possible Π for which there exists a w for which the scalarized value is maximal, U(Π) = {π : π Π w (π Π) V π w V π w } Definition A coverage set CS(Π) is a subset of U(Π) that, for every w, contains a policy with maximal scalarized value, i.e., ( ) CS(Π) U(Π) ( w)( π) π CS(Π) (π Π) Vw π Vw π Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
46 Example Vw π w = true w = false π = π π = π π = π π = π One binary weight feature: only two possible weights Weights are not objectives but two possible scalarizations Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
47 Example Vw π w = true w = false π = π π = π π = π π = π One binary weight feature: only two possible weights Weights are not objectives but two possible scalarizations U(Π) = {π 1, π 2, π 3 } but CS(Π) = {π 1, π 2 } or {π 2, π 3 } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
48 Execution Phase Single policy selected from CS(Π) and executed Unknown weights: weights revealed and maximizing policy selected: π = arg max Vw π π CS(Π) Decision support: CS(Π) is manually inspected by the user Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
49 Linear Scalarization & Multiple Policies Definition The convex hull CH(Π) is the subset of Π for which there exists a w that maximizes the linearly scalarized value: CH(Π) = {π : π Π w (π Π) w V π w V π } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
50 Linear Scalarization & Multiple Policies Definition The convex hull CH(Π) is the subset of Π for which there exists a w that maximizes the linearly scalarized value: CH(Π) = {π : π Π w (π Π) w V π w V π } Definition The convex coverage set CCS(Π) is a subset of CH(Π) that, for every w, contains a policy whose linearly scalarized value is maximal, i.e., CCS(Π) CH(Π) ( w)( π) (π CCS(Π) (π Π) w V π w V π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
51 Visualization Objective Space Weight Space V V w V w 1 V w = w 0 V 0 + w 1 V 1, w 0 = 1 w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
52 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: mining gold and silver Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
53 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
54 Monotonically Increasing Scalarization Functions Mining example: V π 1 = (3, 0), V π 2 = (0, 3), V π 3 = (1, 1) Choosing V π 3 implies nonlinear scalarization function Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
55 Monotonically Increasing Scalarization Functions Definition A scalarization function is strictly monotonically increasing if changing a policy such that its value increases in one or more objectives, without decreasing in any other objectives, also increases the scalarized value: ( i Vi π Vi π i Vi π > Vi π ) ( w Vw π > Vw π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
56 Monotonically Increasing Scalarization Functions Definition A scalarization function is strictly monotonically increasing if changing a policy such that its value increases in one or more objectives, without decreasing in any other objectives, also increases the scalarized value: ( i Vi π Vi π i Vi π > Vi π ) ( w Vw π > Vw π ) Definition A policy π Pareto-dominates another policy π when its value is at least as high in all objectives and strictly higher in at least one objective: V π P V π i Vi π Vi π i Vi π > Vi π A policy is Pareto optimal if no policy Pareto-dominates it. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
57 Nonlinear Scalarization Can Destroy Additivity Nonlinear scalarization and expectation do not commute: Vw π = f (V π, w) = f (E[ γ k r t+k+1 ], w) E[ γ k f (r t+k+1, w)] k=0 k=0 Bellman-based methods not applicable Local action selection no longer yields an optimal policy: π (s) arg max V (s) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
58 Deterministic vs. Stochastic Policies Stochastic are fine in most settings Sometimes inappropriate, e.g., medical treatment In MDPs, requiring is not restrictive Optimal value attainable with stationary policy: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
59 Deterministic vs. Stochastic Policies Stochastic are fine in most settings Sometimes inappropriate, e.g., medical treatment In MDPs, requiring is not restrictive Optimal value attainable with stationary policy: π (s) = arg max T (s, a, s )[R(s, a, s ) + γv (s )] a s Similar for MOMDPs with linear scalarization MOMDPs with nonlinear scalarization: Stochastic may be preferable if allowed Nonstationary may be preferable otherwise Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
60 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
61 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
62 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ π ns alternates between a 1 and a 2, starting with a 1 : ( 3 V πns = 1 γ 2, 3γ ) 1 γ 2 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
63 White s Example (1982) 3 actions: R(a 1 ) = (3, 0), R(a 2 ) = (0, 3), R(a 3 ) = (1, 1) 3 stationary, all Pareto-optimal: ( V π 1 3 ) ( = 1 γ, 0, V π 2 3 ) ( = 0,, V π 3 1 = 1 γ 1 γ, 1 ) 1 γ π ns alternates between a 1 and a 2, starting with a 1 : ( 3 V πns = 1 γ 2, 3γ ) 1 γ 2 Thus π ns P π 3 when γ 0.5, e.g., γ = 0.5 and f (V π ) = V π 1 V π 2 : V π 1 = V π 2 = 0, V π 3 = 4, V πns = 8 V gamma=0.3 V gamma=0.5 V gamma=0.7 V gamma= V V V V 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
64 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
65 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
66 Mixture Policies A mixture policy π m selects i-th policy from set of N with probability p i, where N i=0 p i = 1 Values are convex combination of values of constituent In White s example, replace π ns by π m : V πm = p 1 V π 1 + (1 p 1 )V π 2 = ( 3p1 1 γ, 3(1 p ) 1) 1 γ Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
67 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: studying vs. networking Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
68 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
69 Pareto Sets Definition The Pareto front is the set of all that are not Pareto dominated: PF (Π) = {π : π Π (π Π), V π P V π } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
70 Pareto Sets Definition The Pareto front is the set of all that are not Pareto dominated: PF (Π) = {π : π Π (π Π), V π P V π } Definition A Pareto coverage set is a subset of PF (Π) such that, for every π Π, it contains a policy that either dominates π or has equal value to π : ( ) PCS(Π) PF (Π) (π Π)( π) π PCS(Π) (V π P V π V π =V π ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
71 Visualization Objective Space Weight Space V V w V w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
72 Visualization Objective Space Weight Space V V w V w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
73 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy (again) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
74 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: radiation vs. chemotherapy (again) Note: only setting that case requires a Pareto front! Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
75 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
76 Mixture Policies A CCS(Π DS ) is also a CCS(Π) but not necessarily a PCS(Π) But a PCS(Π) can be made by mixing in a CCS(Π DS ) V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
77 Problem Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Example: studying vs. networking (again) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
78 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
79 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
80 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Known transition and reward functions planning Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
81 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Known transition and reward functions planning Unknown transition and reward functions learning Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
82 Background: Value Iteration Initial estimate value estimate V 0 (s) Apply Bellman backups until convergence: [ ] V k+1 (s) max T (s, a, s ) R(s, a, s ) + γv k (s ) a s Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
83 Background: Value Iteration Initial estimate value estimate V 0 (s) Apply Bellman backups until convergence: [ ] V k+1 (s) max T (s, a, s ) R(s, a, s ) + γv k (s ) a s Can also be written: V k+1 (s) max Q k+1 (s, a), a Q k+1 (s, a) [ ] T (s, a, s ) R(s, a, s ) + γv k (s ) s Optimal policy is easy to retrieve from Q-table Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
84 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
85 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
86 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
87 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Apply standard VI Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
88 Scalarize MOMDP + Value Iteration For known w Vw π = w E[ γ k r t+k+1 ] = E[ γ k (w r t+k+1 )]. k=0 k=0 Scalarize reward function of MOMDP R w = w R Apply standard VI Does not return multi-objective value Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
89 Scalarized Value Iteration Adapt Bellman backup: w V k+1 (s) max w Q k+1 (s, a), a Q k+1 (s, a) [ ] T (s, a, s ) R(s, a, s ) + γv k (s ) s Returns multi-objective value. Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
90 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
91 Inner versus Outer Loop... max max max SO method MO inner loop MO outer loop Inner loop Adapting operators of single objective method (e.g., value iteration) Series of multi-objective operations (e.g. Bellman backups) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
92 Inner versus Outer Loop... max max max SO method MO inner loop MO outer loop Inner loop Adapting operators of single objective method (e.g., value iteration) Series of multi-objective operations (e.g. Bellman backups) Outer loop Single objective method as subroutine Series of single-objective problems Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
93 Inner Loop: Convex Hull Value Iteration Barrett & Narayanan (2008) Idea: do the backup for all w in parallel New backup operators must handle sets of values. At backup: generate all value vectors for s, a-pair prune away those that are not optimal for any w Only need stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
94 Inner Loop: Convex Hull Value Iteration Initial set of value vectors, e.g., V 0 (s) = {(0, 0)} All possible value vectors: Q k+1 (s, a) T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] s where u + V = {u + v : v V }, and U V = {u + v : u U v V } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
95 Inner Loop: Convex Hull Value Iteration Initial set of value vectors, e.g., V 0 (s) = {(0, 0)} All possible value vectors: Q k+1 (s, a) T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] s where u + V = {u + v : v V }, and U V = {u + v : u U v V } Prune value vectors ( ) V k+1 (s) CPrune Q k+1 (s, a) CPrune uses linear programs (e.g., Roijers et al. (2015)) a Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
96 CHVI Example Extremely simple MOMDP: 1 state: s; 2 actions: a 1 and a 2 Deterministic transitions Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 V 0 (s) = {(0, 0)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
97 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
98 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
99 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} V 1 (s) = CPrune( a Q 1(s, a)) = {(2, 0), (0, 2)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
100 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
101 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
102 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V w V 2 (s) = CPrune({(3, 0), (2, 1), (1, 2), (0, 3)}) w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
103 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V 2 (s) = {(3, 0), (0, 3)} V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
104 CHVI Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 3: V 2 (s) = {(3, 0), (0, 3)} Q 3 (s, a 1 ) = {(3.5, 0), (2, 1.5)} Q 3 (s, a 2 ) = {(1.5, 2), (0, 3.5)} V w V 3 (s) = CPrune({(3.5, 0), (2, 1.5), (1.5, 2), (0, 3.5)}) = {(3.5, 0), (0, 3.5)} w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
105 Convex Hull Value Iteration CPrune retains at least one optimal vector for each w Therefore, V w that would have been computed by VI is kept CHVI does not retain excess value vectors Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
106 Convex Hull Value Iteration CPrune retains at least one optimal vector for each w Therefore, V w that would have been computed by VI is kept CHVI does not retain excess value vectors CHVI generates a lot of excess value vectors Removal with linear programs (CPrune) is expensive Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
107 Outer Loop... max max max SO method MO inner loop MO outer loop Repeatly calls a single-objective solver Generic multi-objective method multi-objective coordination graphs multi-objective (multi-agent) MDPs multi-objective partially observable MDPs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
108 Outer Loop: Optimistic Linear Support Optimistic linear support (OLS) adapts and improves linear support for POMDPs (Cheng (1988)) Solves scalarized instances for specific w Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
109 Outer Loop: Optimistic Linear Support Optimistic linear support (OLS) adapts and improves linear support for POMDPs (Cheng (1988)) Solves scalarized instances for specific w Terminates after checking only a finite number of weights Returns exact CCS Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
110 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
111 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
112 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
113 Linear Support V w w 1 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
114 Optimistic Linear Support u w (1,8) w c Δ (7,2) u w (1,8) (7,2) (5,6) w w 1 Priority queue, Q, for corner weights Maximal possible improvement as priority Stop when < ε Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
115 Optimistic Linear Support Solving scalarized instance not always possible ε-approximate solver Produces an ε-ccs Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
116 Reusing Value Functions Observation: when w are close so are optimal V Hot start with V(s) from close w Optimistic Linear Support with Alpha Reuse (OLSAR) ε 1e-04 1e-01 1e+02 RS RAR OLS+ OLSAR runtime (ms) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
117 Comparing Inner and Outer Loop OLS (outer loop) advantages Any (cooperative) multi-objective decision problem Any single-objective / scalarized subroutine Inherits quality guarantees Faster for small and medium numbers of objectives Inner loop faster for large numbers of objectives Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
118 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
119 Inner Loop: Pareto-Q Similar to CHVI Different pruning operator Pairwise comparisons: V(s) P V (s) Comparisons cheaper but much more vectors Converges to correct Pareto coverage set (White (1982)) Executing a policy is no longer trivial (Van Moffaert & Nowé (2014)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
120 Inner Loop: Pareto-Q Compute all possible vectors Q k+1 (s, a) s T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] where u + V = {u + v : v V }, U V = {u + v : u U v V } Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
121 Inner Loop: Pareto-Q Compute all possible vectors Q k+1 (s, a) s T (s, a, s ) [ R(s, a, s ) + γv k (s ) ] where u + V = {u + v : v V }, U V = {u + v : u U v V } Take the union across a Prune Pareto-dominated vectors ( ) V k+1 (s) PPrune Q k+1 (s, a) a Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
122 Pareto-Q Example Extremely simple MOMDP: 1 state: s; 2 actions: a 1 and a 2 Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 V 0 (s) = {(0, 0)} V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
123 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 1: V 0 (s) = {(0, 0)} Q 1 (s, a 1 ) = {(2, 0)} Q 1 (s, a 2 ) = {(0, 2)} V 1 (s) = PPrune( a Q 1(s, a)) = {(2, 0), (0, 2)} V V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
124 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 1 (s) = {(2, 0), (0, 2)} Q 2 (s, a 1 ) = {(3, 0), (2, 1)} Q 2 (s, a 2 ) = {(1, 2), (0, 3)} V 1 V 2 (s) = PPrune({(3, 0), (2, 1), (1, 2), (0, 3)}) V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
125 Pareto-Q Example Deterministic rewards: R(s, a 1, s) (2, 0) R(s, a 2, s) (0, 2) γ = 0.5 Iteration 2: V 2 (s) = {(3, 0), (2, 1), (1, 2), (0, 3)} Q 3 (s, a 1 ) = {(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5)} Q 3 (s, a 2 ) = {(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)} V V 3 (s) = PPrune({(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5), (1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)}) V 0 Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
126 Inner Loop: Pareto-Q PCS size can explode No longer Cannot read policy from Q-table Except for first action Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
127 Inner Loop: Pareto-Q PCS size can explode No longer Cannot read policy from Q-table Except for first action Track a policy during execution (Van Moffaert & Nowé (2014)) For transitions: s, a s From Qt=0 (s, a) substract R(s, a) Correct for discount factor V t=1 (s ) Find Vt=1 (s ) in Q-tables for s For stochastic transitions, see Kristoff van Moffaert s PhD thesis Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
128 Outer Loop?... max max max SO method MO inner loop MO outer loop Outer loop very difficult: Vw π = f (E[ γ k r t+k+1 ], w) E[ γ k f (r t+k+1, w)] k=0 Maximization does not do the trick! k=0 Heuristic with non-linear f (Van Moffaert, Drugan, Nowé (2013)) Not guaranteed to find optimal policy, or converge Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
129 Taxonomy linear scalarization monotonically increasing scalarization single policy (known weights) multiple (unknown weights or decision support) stochastic stochastic one stationary policy one nonstationary policy one mixture policy of two or more stationary convex coverage set of stationary Pareto coverage set of nonstationary convex coverage set of stationary Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
130 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
131 Non-linear f and stationary Scalarized returns are not additive. π : S A CON-MODP (Wiering & De Jong (2007)) Does multi-objective Bellman backups, like Pareto-Q Additional CONsistency operator to force stationarity Guaranteed to converge to correct PCS Computationally expensive Heuristic method: Pareto Local Policy Search (Kooijman et al. (2015)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
132 Learning Scenarios learning phase selection phase execution phase dataset Learning Algorithm coverage set Scalarization selected policy weights Model-based learning (Wiering, Withagen & Drugan (2014)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
133 Learning Methods Q-Learning schema: Q new (s, a) (1 α)q old (s, a) α ( R(s, a, s ) + γv old (s ) ) CCS: Parallel Q-learning (Hiraoka, Yoshida & Mashima (2009)) Det. Non-stat. PCS: Pareto-Q Learning (Van Moffaert & Nowé (2014)) Det. Stat. PCS: schema not appliable Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
134 Learning Methods Q-Learning schema: Q new (s, a) (1 α)q old (s, a) α ( R(s, a, s ) + γv old (s ) ) CCS: Parallel Q-learning (Hiraoka, Yoshida & Mashima (2009)) Det. Non-stat. PCS: Pareto-Q Learning (Van Moffaert & Nowé (2014)) Det. Stat. PCS: schema not appliable Model-based learning for Det. Stat. PCS (Wiering, Withagen & Drugan (2014)) Various evolutionary approaches (e.g., Handa (2009)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
135 Learning Scenarios states, actions, rewards Learning Algorithm partial coverage set Policy Selector selected policy Environment weights Dynamic preferences (Natarajan & Tadepalli (2005)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
136 Other Decision Problems Taxonomy applies to cooperative decision problems Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
137 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
138 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Multi-objective POMDPs (Soh & Demeris (2011); Roijers, Whiteson & Oliehoek (2015b); Wray & Zilberstein (2015)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
139 Other Decision Problems Taxonomy applies to cooperative decision problems Multi-objective coordination graphs / constraint optimization problems (Rollón (2008); Roijers, Whiteson & Oliehoek (2015a); Wilson, Abdul, Marinescu (2015) ) Multi-objective POMDPs (Soh & Demeris (2011); Roijers, Whiteson & Oliehoek (2015b); Wray & Zilberstein (2015)) Multi-objective bandit problems (Drugan & Nowé (2013)) Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
140 Part 2: Methods and Applications Convex Coverage Set Planning Methods Inner Loop: Convex Hull Value Iteration Outer Loop: Optimistic Linear Support Pareto Coverage Set Planning Methods Inner loop (non-stationary): Pareto-Q Outer loop issues Beyond the Taxonomy Applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
141 Medical Applications Treatment planning (Lizotte (2010, 2012)) Maximizing effectiveness of the treatment Minimizing the severity of the side-effects Finite-horizon MOMDPs Deterministic Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
142 Medical Applications Anthrax response (Soh & Demiris (2011)) Minimizing loss of life Minimizing number of false alarms Minimizing cost of investigation Partial observability (MOPOMDP) Finite-state controllers Evolutionary method Pareto coverage set Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
143 Medical Applications Control system for wheelchairs (Soh & Demiris (2011)) I I I Maximizing safety Maximizing speed Minimizing power consumption. Partial observability (MOPOMDP) Finite-state controllers Evolutionary method Pareto coverage set Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
144 Environmental Applications Environmental control of a building (Kwak et al. (2012)) Minimizing energy consumption Maximizing comfort Reduce energy consumption by 30% w.r.t. old system While maintaining comfort Groups of people with different preferences Smart grids Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
145 Broader Application Probabilistic Planning is Multi-objective Bryce et al. (2007) The expected return is not enough Cost of a plan Probability of success of a plan Non-goal terminal states Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
146 Closing Consider multiple objectives most problems have them a priori scalarization can be bad Derive your solution set Pareto front often not necessary Promising applications Whiteson & Roijers (UOx) Multi-Objective Decision Making July 25, / 1
Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs
Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Diederik M. Roijers Vrije Universiteit Brussel & Vrije Universiteit Amsterdam Erwin Walraven Delft University of Technology
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationMotivation for introducing probabilities
for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationLinear Support for Multi-Objective Coordination Graphs
Linear Support for Multi-Objective Coordination Graphs Diederik M. Roijers Informatics Institute University of Amsterdam Amsterdam, the Netherlands d.m.roijers@uva.nl Shimon Whiteson Informatics Institute
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationSequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague
Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationCSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam
CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationCS 598 Statistical Reinforcement Learning. Nan Jiang
CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationUsing first-order logic, formalize the following knowledge:
Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationCS 4100 // artificial intelligence. Recap/midterm review!
CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks
More informationOptimizing Memory-Bounded Controllers for Decentralized POMDPs
Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationA Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty
2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of
More informationHidden Markov Models (HMM) and Support Vector Machine (SVM)
Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationPoint-Based Value Iteration for Constrained POMDPs
Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationOptimal Control of Partiality Observable Markov. Processes over a Finite Horizon
Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationChapter 16 Planning Based on Markov Decision Processes
Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationComputing Possibly Optimal Solutions for Multi-Objective Constraint Optimisation with Tradeoffs
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 215) Computing Possibly Optimal Solutions for Multi-Objective Constraint Optimisation with Tradeoffs Nic
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationSequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague
Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague /doku.php/courses/a4b33zui/start pagenda Previous lecture: individual rational
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationPlanning Under Uncertainty II
Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov
More informationRL 14: POMDPs continued
RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationSequential Decision Problems
Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted
More informationEfficient Maximization in Solving POMDPs
Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark
More informationOn Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets
On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash
More informationCS 188 Introduction to Fall 2007 Artificial Intelligence Midterm
NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.
More informationLearning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems
Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems Eugenio Bargiacchi Timothy Verstraeten Diederik M. Roijers 2 Ann Nowé Hado van Hasselt 3 Abstract
More information16.410/413 Principles of Autonomy and Decision Making
16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationTowards Faster Planning with Continuous Resources in Stochastic Domains
Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationCourse basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.
Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.
More informationMulti Objective Optimization
Multi Objective Optimization Handout November 4, 2011 (A good reference for this material is the book multi-objective optimization by K. Deb) 1 Multiple Objective Optimization So far we have dealt with
More informationAM 121: Intro to Optimization Models and Methods: Fall 2018
AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted
More informationHomework 2: MDPs and Search
Graduate Artificial Intelligence 15-780 Homework 2: MDPs and Search Out on February 15 Due on February 29 Problem 1: MDPs [Felipe, 20pts] Figure 1: MDP for Problem 1. States are represented by circles
More information