Preference Elicitation for Sequential Decision Problems
|
|
- Wilfrid Hutchinson
- 5 years ago
- Views:
Transcription
1 Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto
2 Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These approaches require A model of dynamics A model of rewards
3 Introduction 3 Motivation Except in some simple cases, the specification of rewards is problematic Preferences about which states/actions are good and bad need to be translated into precise numerical reward Time consuming to specify reward for all states/actions Rewards can vary user-to-user
4 Introduction 4 Motivation The field of Preference Elicitation has wide variety of approaches to specifying utility for single-step decision making. There has been comparatively little work done on extending these approaches to multi-step decision making.
5 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research
6 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research
7 Single-step Decision Making 7 Decision Theory Decision theory provides a framework for modeling the preferences of a user and stipulates how optimal decisions are to be made based on these preferences. Given A set of possible outcomes X = X 1, X 2,, X n A utility function U :X The utility function can often encode independence assumptions derived from a domain However There are often a large number of outcomes Specifying a utility for each outcome is problematic
8 Single-step Decision Making 8 Preference Elicitation Specify the utility incrementally Done Decision Problem Compute Decision yes Utility decision measure Satisfied? User Select Query no response query
9 Single-step Decision Making 9 Partial Preferences - Strict Uncertainty Strict uncertainty is represented by a feasible utility set U u u x 3 2 maximin x U [ MMN] = argmax x X min u U u(x) x minimax regret x U [ MMR] = argmin x X max x' X max u U [ u(x') u(x) ]
10 Single-step Decision Making 10 Partial Preferences - Bayesian Uncertainty Given a prior σ over utility functions expected utility x U [ EU] = argmax x X E σ u U [ u(x) ] percentile criterion x U [ VAR] = argmax x X max Pr ( u u(x) y) η y
11 Single-step Decision Making 11 Query Types Cognitive Ease Comparison: Do you prefer x to y? Ranking: Please rank the following set of k outcomes... Information Gain
12 Single-step Decision Making 12 Query Selection In order to choose queries we look at the value of the potential responses For strict uncertainty value corresponds to Reducing Uncertainty [I05,T03,T04] Reducing Regret [B05] For Bayesian uncertainty value corresponds to Expected Value of Information [B02,C02,H03]
13 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research
14 Markov Decision Processes 14 The Markov Decision Process a t a t+1 S A - Set of States - Set of Actions Pr(s' a, s) - Transitions s t s t+1 s t+2 r t r t+1 r t+2 γ - Discount Factor WORLD r(s) - Reward [or r(s, a) ] States AGENT Actions
15 Markov Decision Processes 15 Policies Policy A (stationary) policy action. π maps each state to an Policy Value Given a policy π V π (s 0 ) = E, the value of a state is γ t r π,s 0 t=0 Bellman Equation V π * (s) = s' max r(s,a a π * ) +γ Pr( s s,a π * ) V π * (s')
16 Markov Decision Processes 16 Computing Optimal Policies Value Iteration [Bellman 1966] Given an initial value function repeated backups will converge to optimal value function Policy Iteration [Howard 1960] 1. Policy evaluation: finds value of the current policy 2. Policy improvement: performs one backup and finds the best policy Linear Programming [Puterman 1994] Encodes Bellman s equation using S variables and SA constraints
17 Markov Decision Processes 17 Scaling Abstraction [BDG95, DG97, BDH99] Grouping together and treating as one any states that have the same optimal action or have the same value Decomposition [M98, SC98, BDH99] A set of smaller sub-mdps which are solved independently and locally optimal policies are combined to form approximate global policy Approximation [SP01,P02,G03] Value function approximated by lower dimensional linear combination of basis functions
18 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research
19 Model Uncertainty 19 Robust MDPs [Bagnell et al. 2001, Iyengar 2005, Nilim & Ghaoui 2005] Unknown model parameters: Transitions Decision criterion: Maximin Use dynamic programming approach to compute minimax optimal action at each time step π = argmax π min P P E x P, π t γ t R(x) Q t (s,a) = min r(s) + γ P(s' s,a)v (s) t 1 P P s' V t (s) = maxq t (s,a) P a
20 Model Uncertainty 20 Robust MDPs [McMahen, Gordon & Blum 2005] Unknown model parameters: Rewards Decision criterion: Maximin Use linear programming approach with constraint generation π = argmax π min R R E π x t γ t R(x) maximize: δ, π δ subject to : δ V π R R R R
21 Model Uncertainty 21 Robust MDPs [Delage & Mannor 2007] Unknown model parameters: Transitions & Rewards Decision criterion: Percentile Criterion Solve for reward in the form of a Gaussian as a SOCP Give an approximation for transitions in the form of Dirichlets maximize: π, y y subject to : Pr E γ t r t (x t ) π y η t=0 y η
22 Model Uncertainty 22 Other Approaches Reinforcement Learning [KS02, BT03] --- Bayesian [D02, P06] Assumes transition (and reward) not known beforehand. In Bayesian RL we have a prior over transition and reward. This is an online model. Inverse Reinforcement Learning [NR00] Assumes reward function is unknown, but that we have examples of the optimal policy being executed. Decision aims to find the best reward function.
23 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research
24 Elicitation 24 Policy Teaching & Bayesian RL Reinforcement Learning [KS02, BT03] --- Bayesian [D02, P06] Actions yield observation transition & reward functions. Choose actions to balance explore/exploit tradeoff. This learning happens online. Policy Teaching [ZP08, ZPC09] A hidden reward function learned by adding incentives to the hidden reward function and observing behaviour
25 Elicitation 25 Robust MDPs Uncertainty Measure Elicitation? [B01,I05,NG05] Transitions Maximin [MGB05] Reward Maximin [DM07] Reward Percentile Approximates myopic EVOI Uses equivalence queries: What is r(s,a)?
26 Elicitation 26 Robust MDPs Uncertainty Measure Elicitation? [B01,I05,NG05] Transitions Maximin [MGB05] Reward Maximin [DM07] Reward Percentile??
27 Outline 1. Single-step Decision Making 2. Sequential Decision Making 3. Sequential Model Uncertainty 4. Eliciting Sequential Preferences 5. Proposed Research
28 Directions 28 Summary Our goal is to efficiently elicit reward functions for Markov decision problems. To reach this goal we must focus on: 1. Developing effective methods for computing good (robust) policies given reward uncertainty 2. Developing reward queries that are conceptually tractable and computationally efficient 3. Developing strategies to select queries to quickly produce better policies MDP Reward Compute Decision decision measure Done yes Satisfied? User Select Query no response query
29 Directions 29 Computing Robust Policies The Minimax Regret Criterion can be applied to computing policies arg min π max π max r R V π ' r V π r It offers a number of desirable properties 1. Offers a (non-probabilistic) guarantee 2. Less conservative than maximin 3. The relative comparison between current choice and best possible choice offers an intuitive measure Ongoing work has developed several novel approaches to computing Minimax Regret for Markov decision processes
30 Directions 30 Computing Robust Policies Ongoing work has developed several novel approaches to computing Minimax Regret for Markov decision processes Exact formulations using linear and mixed Integer programming with constraint generation [NIPS 08] Precomputation of non-dominated policies Factored MDPs Approximations [UAI 09]
31 Directions 31 Reward Queries With respect to individual reward points, we can use many of the query types developed for single-step decision making Bounding: is r(s,a) b? Comparison: is r(s,a) r(s',a')? There is potential for queries which are sequential in nature Policy: Trajectory: is V π V π? is s 1,a 1,,s k 1,a k 1,s k s 1,a 1,,s k 1,a k 1,s k?
32 Directions 32 Summary Computing Minimax Regret Reward Queries Query Selection Exact Methods Using MIP + constraint generation [UAI 09, NIPS08] Using non-dominated policies [In progress] Approximations [In progress] Bound queries [UAI 09, NIPS 08] Richer (sequential) queries [Future work] Volumetric [Tech Report] Regret Based [UAI 09]
33 Thank you
34 Conclusion 34 Future Work Richer Queries Do you prefer tradeoff f (s 2,a 3 ) = f 1 amount of time doing (s 2,a 3 ) and f (s 1,a 4 ) = f 2 amount of time doing (s 1,a 4 ) or f (s 2,a 3 ) = f amount of time doing (s,a ) and f (s 1,a 4 ) = f amount of time doing (s,a )? f 1 f 2 f 1 s No Street Car a Waiting f 2 s Cab Available a Take Cab f 2 ' f 1 ' f 1 ' s No Street Car a Waiting f 2 ' s Cab Available a Take Cab
35 Appendix 35 Full Formulation Master minimize f,δ δ (8) subject to: r g r f δ g F, r R γe f + α = 0 Subproblem maximize Q,V,I,r α V r f (9) subject to: Q a = r a + γp a V a A V Q a a A (10) V (1 I a )M a + Q a a A (11) Cr d X I a = 1 (12) a I a (s) {0, 1} a, s (13) M a = M M a
36 Computation 36 Approximating Minimax Regret We relax the Max Regret MIP formulation The value of the resulting policy is no longer exact, however, resulting reward still feasible. We find optimal policy w.r.t. to resulting reward
37 Computation 37 Scaling (Log Scale)
38 Evaluation 38 Experimental Setup Randomly generated MDPs Semi-sparse random transition function, discount factor of 0.95 Random true reward drawn from fixed interval, upper and lower bounds on reward drawn randomly All results are averaged over 20 runs 10 states 5 actions
39 Evaluation 39 Elicitation Effectiveness We examine the combination of each criteria for robust policies with each of the elicitation strategies Minimax Regret (MMR) Maximin Regret (MR) Halve the Largest Gap (HLG) Current Solution (CS)
40 Evaluation 40 Max Regret - Random MDP Max Regret
41 Evaluation 41 True Regret (Loss) - Random MDP True Regret
42 Evaluation 42 Maximin Value - Random MDP Maximin Value
43 Evaluation 43 Queries per Reward Point - Random MDP Most of reward space unexplored We repeatedly query a small set of high impact reward points 100
44 Evaluation 44 Autonomic Computing Host 1 Demand Resource Total Resource Setup 2 Hosts 3 Demand levels 3 Units of Resource Host k Demand Resource Model 90 States 10 Actions
45 Evaluation 45 Max Regret - Autonomic Computing Queries vs. Max Regret Maximin Minimax Regret 0.5 Max Regret Queries
46 Evaluation 46 True Regret (Loss) - Autonomic Computing Queries vs. True Regret Maximin Minimax Regret 0.08 True Regret Queries
Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies
Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies Kevin Regan University of Toronto Toronto, Ontario, Canada, M5S 3G4 kmregan@cs.toronto.edu Craig Boutilier University of Toronto
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationPracticable Robust Markov Decision Processes
Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationMaximum Margin Planning
Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement
More informationAn Analytic Solution to Discrete Bayesian Reinforcement Learning
An Analytic Solution to Discrete Bayesian Reinforcement Learning Pascal Poupart (U of Waterloo) Nikos Vlassis (U of Amsterdam) Jesse Hoey (U of Toronto) Kevin Regan (U of Waterloo) 1 Motivation Automated
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationLightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty
JMLR: Workshop and Conference Proceedings vol (212) 1 12 European Workshop on Reinforcement Learning Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor Technion Ofir Mebel
More informationModule 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationLightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty
Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering, Technion, Israel Ofir Mebel ofirmebel@gmail.com Department
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationSolving Uncertain MDPs with Objectives that Are Separable over Instantiations of Model Uncertainty
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Solving Uncertain MDPs with Objectives that Are Separable over Instantiations of Model Uncertainty Yossiri Adulyasak, Pradeep
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More information2534 Lecture 4: Sequential Decisions and Markov Decision Processes
2534 Lecture 4: Sequential Decisions and Markov Decision Processes Briefly: preference elicitation (last week s readings) Utility Elicitation as a Classification Problem. Chajewska, U., L. Getoor, J. Norman,Y.
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationOn Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets
On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationPlanning by Probabilistic Inference
Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.
More informationBayes-Adaptive POMDPs 1
Bayes-Adaptive POMDPs 1 Stéphane Ross, Brahim Chaib-draa and Joelle Pineau SOCS-TR-007.6 School of Computer Science McGill University Montreal, Qc, Canada Department of Computer Science and Software Engineering
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationReading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationStochastic Safest and Shortest Path Problems
Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012 Path optimization under probabilistic uncertainties Problems coming to searching for
More informationOnline Feature Elicitation in Interactive Optimization
Craig Boutilier Kevin Regan Paolo Viappiani Dept. of Computer Science, University of Toronto, Toronto, ON, CANADA cebly@cs.toronto.edu kmregan@cs.toronto.edu paolo@cs.toronto.edu Abstract Most models of
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationPlanning and Model Selection in Data Driven Markov models
Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationCS 598 Statistical Reinforcement Learning. Nan Jiang
CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL
More informationAM 121: Intro to Optimization Models and Methods: Fall 2018
AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationLearning Control Under Uncertainty: A Probabilistic Value-Iteration Approach
Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationOnline Feature Elicitation in Interactive Optimization
Craig Boutilier Kevin Regan Paolo Viappiani Dept. of Computer Science, University of Toronto, Toronto, ON, CANADA cebly@cs.toronto.edu kmregan@cs.toronto.edu paolo@cs.toronto.edu Abstract Most models of
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationMAP Inference for Bayesian Inverse Reinforcement Learning
MAP Inference for Bayesian Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim bdepartment of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea jdchoi@ai.kaist.ac.kr,
More informationTopics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems
Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationCSC242: Intro to AI. Lecture 23
CSC242: Intro to AI Lecture 23 Administrivia Posters! Tue Apr 24 and Thu Apr 26 Idea! Presentation! 2-wide x 4-high landscape pages Learning so far... Input Attributes Alt Bar Fri Hun Pat Price Rain Res
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationPolicy Gradients with Variance Related Risk Criteria
Aviv Tamar avivt@tx.technion.ac.il Dotan Di Castro dot@tx.technion.ac.il Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering, The Technion - Israel Institute of Technology, Haifa, Israel
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationRobust Modified Policy Iteration
Robust Modified Policy Iteration David L. Kaufman Department of Industrial and Operations Engineering, University of Michigan 1205 Beal Avenue, Ann Arbor, MI 48109, USA 8davidlk8umich.edu (remove 8s) Andrew
More informationProbabilistic Planning. George Konidaris
Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More informationMulti-model Markov Decision Processes
Multi-model Markov Decision Processes Lauren N. Steimle Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, steimle@umich.edu David L. Kaufman Management Studies,
More informationCSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam
CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost
More informationUniversity of Alberta
University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationAn Analysis of Model-Based Interval Estimation for Markov Decision Processes
An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University
More informationLearning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationPartially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS
Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal
More information