Preference Elicitation for Sequential Decision Problems

Size: px

Start display at page:

Download "Preference Elicitation for Sequential Decision Problems"

Wilfrid Hutchinson
5 years ago
Views:

1 Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto

2 Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These approaches require A model of dynamics A model of rewards

3 Introduction 3 Motivation Except in some simple cases, the specification of rewards is problematic Preferences about which states/actions are good and bad need to be translated into precise numerical reward Time consuming to specify reward for all states/actions Rewards can vary user-to-user

4 Introduction 4 Motivation The field of Preference Elicitation has wide variety of approaches to specifying utility for single-step decision making. There has been comparatively little work done on extending these approaches to multi-step decision making.

5 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research

6 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research

7 Single-step Decision Making 7 Decision Theory Decision theory provides a framework for modeling the preferences of a user and stipulates how optimal decisions are to be made based on these preferences. Given A set of possible outcomes X = X 1, X 2,, X n A utility function U :X The utility function can often encode independence assumptions derived from a domain However There are often a large number of outcomes Specifying a utility for each outcome is problematic

Decision Problem Compute Decision yes Utility

8 Single-step Decision Making 8 Preference Elicitation Specify the utility incrementally Done Decision Problem Compute Decision yes Utility decision measure Satisfied? User Select Query no response query

9 Single-step Decision Making 9 Partial Preferences - Strict Uncertainty Strict uncertainty is represented by a feasible utility set U u u x 3 2 maximin x U [ MMN] = argmax x X min u U u(x) x minimax regret x U [ MMR] = argmin x X max x' X max u U [ u(x') u(x) ]

10 Single-step Decision Making 10 Partial Preferences - Bayesian Uncertainty Given a prior σ over utility functions expected utility x U [ EU] = argmax x X E σ u U [ u(x) ] percentile criterion x U [ VAR] = argmax x X max Pr ( u u(x) y) η y

11 Single-step Decision Making 11 Query Types Cognitive Ease Comparison: Do you prefer x to y? Ranking: Please rank the following set of k outcomes... Information Gain

12 Single-step Decision Making 12 Query Selection In order to choose queries we look at the value of the potential responses For strict uncertainty value corresponds to Reducing Uncertainty [I05,T03,T04] Reducing Regret [B05] For Bayesian uncertainty value corresponds to Expected Value of Information [B02,C02,H03]

13 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research

14 Markov Decision Processes 14 The Markov Decision Process a t a t+1 S A - Set of States - Set of Actions Pr(s' a, s) - Transitions s t s t+1 s t+2 r t r t+1 r t+2 γ - Discount Factor WORLD r(s) - Reward [or r(s, a) ] States AGENT Actions

15 Markov Decision Processes 15 Policies Policy A (stationary) policy action. π maps each state to an Policy Value Given a policy π V π (s 0 ) = E, the value of a state is γ t r π,s 0 t=0 Bellman Equation V π * (s) = s' max r(s,a a π * ) +γ Pr( s s,a π * ) V π * (s')

16 Markov Decision Processes 16 Computing Optimal Policies Value Iteration [Bellman 1966] Given an initial value function repeated backups will converge to optimal value function Policy Iteration [Howard 1960] 1. Policy evaluation: finds value of the current policy 2. Policy improvement: performs one backup and finds the best policy Linear Programming [Puterman 1994] Encodes Bellman s equation using S variables and SA constraints

17 Markov Decision Processes 17 Scaling Abstraction [BDG95, DG97, BDH99] Grouping together and treating as one any states that have the same optimal action or have the same value Decomposition [M98, SC98, BDH99] A set of smaller sub-mdps which are solved independently and locally optimal policies are combined to form approximate global policy Approximation [SP01,P02,G03] Value function approximated by lower dimensional linear combination of basis functions

18 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research

19 Model Uncertainty 19 Robust MDPs [Bagnell et al. 2001, Iyengar 2005, Nilim & Ghaoui 2005] Unknown model parameters: Transitions Decision criterion: Maximin Use dynamic programming approach to compute minimax optimal action at each time step π = argmax π min P P E x P, π t γ t R(x) Q t (s,a) = min r(s) + γ P(s' s,a)v (s) t 1 P P s' V t (s) = maxq t (s,a) P a

20 Model Uncertainty 20 Robust MDPs [McMahen, Gordon & Blum 2005] Unknown model parameters: Rewards Decision criterion: Maximin Use linear programming approach with constraint generation π = argmax π min R R E π x t γ t R(x) maximize: δ, π δ subject to : δ V π R R R R

Model Uncertainty 21 Robust MDPs [Delage & Mannor 2007] Unknown model parameters: Transitions & Rewards Decision criterion: Percentile Criterion Solve for reward

21 Model Uncertainty 21 Robust MDPs [Delage & Mannor 2007] Unknown model parameters: Transitions & Rewards Decision criterion: Percentile Criterion Solve for reward in the form of a Gaussian as a SOCP Give an approximation for transitions in the form of Dirichlets maximize: π, y y subject to : Pr E γ t r t (x t ) π y η t=0 y η

22 Model Uncertainty 22 Other Approaches Reinforcement Learning [KS02, BT03] --- Bayesian [D02, P06] Assumes transition (and reward) not known beforehand. In Bayesian RL we have a prior over transition and reward. This is an online model. Inverse Reinforcement Learning [NR00] Assumes reward function is unknown, but that we have examples of the optimal policy being executed. Decision aims to find the best reward function.

23 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research

24 Elicitation 24 Policy Teaching & Bayesian RL Reinforcement Learning [KS02, BT03] --- Bayesian [D02, P06] Actions yield observation transition & reward functions. Choose actions to balance explore/exploit tradeoff. This learning happens online. Policy Teaching [ZP08, ZPC09] A hidden reward function learned by adding incentives to the hidden reward function and observing behaviour

25 Elicitation 25 Robust MDPs Uncertainty Measure Elicitation? [B01,I05,NG05] Transitions Maximin [MGB05] Reward Maximin [DM07] Reward Percentile Approximates myopic EVOI Uses equivalence queries: What is r(s,a)?

26 Elicitation 26 Robust MDPs Uncertainty Measure Elicitation? [B01,I05,NG05] Transitions Maximin [MGB05] Reward Maximin [DM07] Reward Percentile??

27 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research

28 Directions 28 Summary Our goal is to efficiently elicit reward functions for Markov decision problems. To reach this goal we must focus on: 1. Developing effective methods for computing good (robust) policies given reward uncertainty 2. Developing reward queries that are conceptually tractable and computationally efficient 3. Developing strategies to select queries to quickly produce better policies MDP Reward Compute Decision decision measure Done yes Satisfied? User Select Query no response query

29 Directions 29 Computing Robust Policies The Minimax Regret Criterion can be applied to computing policies arg min π max π max r R V π ' r V π r It offers a number of desirable properties 1. Offers a (non-probabilistic) guarantee 2. Less conservative than maximin 3. The relative comparison between current choice and best possible choice offers an intuitive measure Ongoing work has developed several novel approaches to computing Minimax Regret for Markov decision processes

30 Directions 30 Computing Robust Policies Ongoing work has developed several novel approaches to computing Minimax Regret for Markov decision processes Exact formulations using linear and mixed Integer programming with constraint generation [NIPS 08] Precomputation of non-dominated policies Factored MDPs Approximations [UAI 09]

31 Directions 31 Reward Queries With respect to individual reward points, we can use many of the query types developed for single-step decision making Bounding: is r(s,a) b? Comparison: is r(s,a) r(s',a')? There is potential for queries which are sequential in nature Policy: Trajectory: is V π V π? is s 1,a 1,,s k 1,a k 1,s k s 1,a 1,,s k 1,a k 1,s k?

32 Directions 32 Summary Computing Minimax Regret Reward Queries Query Selection Exact Methods Using MIP + constraint generation [UAI 09, NIPS08] Using non-dominated policies [In progress] Approximations [In progress] Bound queries [UAI 09, NIPS 08] Richer (sequential) queries [Future work] Volumetric [Tech Report] Regret Based [UAI 09]

33 Thank you

34 Conclusion 34 Future Work Richer Queries Do you prefer tradeoff f (s 2,a 3 ) = f 1 amount of time doing (s 2,a 3 ) and f (s 1,a 4 ) = f 2 amount of time doing (s 1,a 4 ) or f (s 2,a 3 ) = f amount of time doing (s,a ) and f (s 1,a 4 ) = f amount of time doing (s,a )? f 1 f 2 f 1 s No Street Car a Waiting f 2 s Cab Available a Take Cab f 2 ' f 1 ' f 1 ' s No Street Car a Waiting f 2 ' s Cab Available a Take Cab

35 Appendix 35 Full Formulation Master minimize f,δ δ (8) subject to: r g r f δ g F, r R γe f + α = 0 Subproblem maximize Q,V,I,r α V r f (9) subject to: Q a = r a + γp a V a A V Q a a A (10) V (1 I a )M a + Q a a A (11) Cr d X I a = 1 (12) a I a (s) {0, 1} a, s (13) M a = M M a

36 Computation 36 Approximating Minimax Regret We relax the Max Regret MIP formulation The value of the resulting policy is no longer exact, however, resulting reward still feasible. We find optimal policy w.r.t. to resulting reward

37 Computation 37 Scaling (Log Scale)

38 Evaluation 38 Experimental Setup Randomly generated MDPs Semi-sparse random transition function, discount factor of 0.95 Random true reward drawn from fixed interval, upper and lower bounds on reward drawn randomly All results are averaged over 20 runs 10 states 5 actions

39 Evaluation 39 Elicitation Effectiveness We examine the combination of each criteria for robust policies with each of the elicitation strategies Minimax Regret (MMR) Maximin Regret (MR) Halve the Largest Gap (HLG) Current Solution (CS)

40 Evaluation 40 Max Regret - Random MDP Max Regret

41 Evaluation 41 True Regret (Loss) - Random MDP True Regret

42 Evaluation 42 Maximin Value - Random MDP Maximin Value

43 Evaluation 43 Queries per Reward Point - Random MDP Most of reward space unexplored We repeatedly query a small set of high impact reward points 100

44 Evaluation 44 Autonomic Computing Host 1 Demand Resource Total Resource Setup 2 Hosts 3 Demand levels 3 Units of Resource Host k Demand Resource Model 90 States 10 Actions

45 Evaluation 45 Max Regret - Autonomic Computing Queries vs. Max Regret Maximin Minimax Regret 0.5 Max Regret Queries

46 Evaluation 46 True Regret (Loss) - Autonomic Computing Queries vs. True Regret Maximin Minimax Regret 0.08 True Regret Queries

Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies

Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies Kevin Regan University of Toronto Toronto, Ontario, Canada, M5S 3G4 kmregan@cs.toronto.edu Craig Boutilier University of Toronto