Janusz Marecki Zvi Topol

Size: px
Start display at page:

Download "Janusz Marecki Zvi Topol"

Transcription

1 Welcome

2

3 Janusz Marecki

4 Janusz Marecki Zvi Topol

5 Janusz Marecki Zvi Topol Milind Tambe

6

7 Solving MDPs with Continuous Time

8 Why do I care about continuous time?

9 30 min

10 At the airport

11 10:45 12:00 Start 10:15

12 10:45 10:46 10:47 10:48 10:49 10:50 10:51

13

14 Action durations = Uncertainty

15 Challenging planning problems

16 Existing work = Numerical solutions

17 This work = Analytical solutions

18 Huge speedups

19 Outline

20 Domain Model CPH Solver Results Summary

21 Mars rover exploration

22 Mars

23 Landing site

24 Sites of Base interest

25 Lander Basebase

26 Base Lander Basebase

27 Base Exploration Base sites

28 Site1 Base Site2 Site3 Exploration Base sites

29 Site1 Base Site2 Site3 Rover Base location

30 Actions

31 Site1 Base Site2 Site3 Move to Base next Site

32 Site1 Base Site2 Site3 Move to Base next Site

33 Site1 Base Site2 Site3 Return Base to Base

34 Site1 Base Site2 Site3 Return Base to Base

35 Action outcomes = uncertain

36 State B State A State C

37 State B State A State C

38 State B? State A State C?

39 Action durations = uncertain

40 State A State B

41 State A State B?

42 Rewards

43 Explore Site Return to Base

44 Achieved upon action completion

45 Finally

46

47 Deadline

48 Domain Model CPH Solver Results Summary

49 Action duration p(t) Deterministic Stochastic and Discrete Stochastic and Continuous t

50 Action duration p(t) Deterministic Stochastic and Discrete Stochastic and Continuous t

51 Action duration p(t) Deterministic Stochastic and Discrete Stochastic and Continuous t

52 Deadlines

53 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP

54 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP

55 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP

56 Unrealistic

57

58 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP

59 No quality guarantees

60 Number of states blows up

61

62 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP Time Dependent MDP No Deadline MDP MDP Semi MDP

63 Deadline Stochastic continuous

64 Deadline Stochastic continuous

65 Stochastic continuous + Deadline Difficult Problem

66 Why?

67 Policy depends on State s Time-to-deadline t

68 Policy value at s,t

69 V(s)(t) Policy value at s,t

70 V(s)(t) Policy value at s,t V(s) Function over t

71 V(s) V(s)(t) t

72 How to find V(s)?

73 Bellman update

74 Suppose s precedes s

75 s s

76 We assume V(s) V(s)(t) t

77 We derive V(s ) V(s )(t) t

78 We derive V(s ) V(s )(t) ? t

79 Action duration p(t)

80 Q: How to derive V(s )(t)? A: Convolution

81

82 In s time-to-deadline = t 0 t

83 In s time-to-deadline = t Action may consume t p(t ) 0 t' t

84 In s time-to-deadline = t Action may consume t In s time-to-deadline = t -t p(t ) V(s)(t-t ) 0 t' t

85 p(t ) V(s )(t-t )

86 t 0 p(t ) V(s )(t-t ) dt

87 Convolution V(s)(t) = t 0 p(t ) V(s )(t-t ) dt

88 Convolution V(s ) = p * V(s)

89 Computing convolutions Numerical methods Approximation Error guarantees

90 Examples

91 Outcome of convolution p(t) V(s)(t) * t t

92 Numerical methods p(t) p(t) p(t) p(t) t t t Discrete Constant Linear... t

93 Numerical methods Better approximation p(t) p(t) p(t) p(t) t t t Discrete Constant Linear... t

94

95 Better approximation Better approximation

96 Better approximation Better approximation p*v(s) function class

97 Discrete Constant Linear Quadratic Discrete Constant Linear Quadratic

98 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic

99 Discrete Constant Linear Quadratic Discrete Discrete Constant Constant Constant Linear Quadratic

100 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Constant Constant Linear Linear Linear Quadratic

101 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Linear Linear Quadratic Quadratic Quadratic

102 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quadratic Quadratic Cubic

103 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic

104 Better approximation = Intractability

105 Better approximation = Intractability Representation & Dominancy

106 Existing work Discrete p(t) Repeated approximation

107 Discrete p(t)

108 Boyan 02

109 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic

110 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic V(s) and p*v(s) Linear

111 Repeated approximation

112 Lazy Approximation Li 05

113 Big improvement over Discrete p(t)

114 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic

115 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic V(s) Constant p*v(s) Linear

116 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic p*v(s) Approximated to Constant

117 Fastest algorithm with quality guarantees

118 Better approximation = Intractability?

119 Not necessary -I get around this tradeoff -I am using a completely different solution technique -I am going in complete different direction

120 Domain Model CPH Solver Results Summary

121 Key Ideas 1. Phase-Type approximation of p(t) 2. Analytical convolution of p*v(s)

122 1 Phase-Type approximation of p(t)

123 MDP M Approximation MDP M

124 MDP M Approximation MDP M Action durations p(t) = λe λt

125 Suppose a transition in M s1 s2

126 Suppose a transition in M s1 s2 = p (t)

127 What if p (t) λe λt

128 Example

129 Normal Distribution p (t) Mean = 2 Variance = 1

130

131

132 p(t) = 1.37 e 1.37t

133 New transition time from s1 to s2?

134 Approximated p (t)

135 Comparison appr. p (t) p (t)

136 Comparison of p (t) appr. p (t) p (t)

137 Phase-Type approximation More phases = Better approximation Introduce self-transitions Planning horizon?

138 Planning horizon n* Policy less than ɛ away from optimal We have found n*

139 Proof in the paper

140 Rmax = maximum action reward = time to deadline

141 n log e λ 1 e λ ɛ R max (e λ 1)

142 1 Phase-Type approximation of p(t)

143 2 Analytical convolution of p * V(s)

144 Fast convolutions!

145 Action durations p(t) = λe λt

146 We proved 2 things

147 First

148 V(s)(t) t0 t time to deadline

149 V(s) is piecewise V(s)(t) t0 t1 t2 t V1(s) V2(s) V3(s)

150 Each piece Vi(s) = Gamma function

151 Gamma function V i (s)(t) = c s,i,1 e λt ( c s,i,2 + c s,i,3 (λt) c s,i,n+1 (λt) n 1 (n 1)! ). Stored in vector [cs,i,1, cs,i,2, cs,i,3,..., cs,i,n+1 ]

152 V(s) = Piecewise Gamma V(s) = t0 : [cs,0,1, cs,0,2, cs,0,3,..., cs,0,n+1 ] t1 : [cs,0,1, cs,0,2, cs,0,3,..., cs,0,n+1 ]... tm: [cs,m,1, cs,m,2, cs,m,3,..., cs,m,n+1 ]

153 Second

154 V(s ) = p*v(s) Derived analytically Simple vector operations

155 V(s ) t0 : [c s,0,1, c s,0,2,..., c s,0,n+1 ] t1 : [c s,0,1, c s,0,2,..., c s,0,n+1 ]... tm: [c s,m,1, c s,m,2,..., c s,m,n+1 ] V(s) t0 : [cs,0,1, cs,0,2,..., cs,0,n+1 ] t1 : [cs,0,1, cs,0,2,..., cs,0,n+1 ]... tm: [cs,m,1, cs,m,2,..., cs,m,n+1 ]

156 Proof in the paper

157 Algorithm

158 Significant speedups

159 Domain Model CPH Solver Results Summary

160 Experiment 1 Correctness of CPH

161 Experiment 2 Action durations - Exponential

162 Experiment 3 Action durations - Weibull

163 Experiment 4 Action durations - Normal

164 Speedups over all distributions

165 Domain Model CPH Solver Results Summary

166 Summary Continuous Time = Important Problem Phase Type approximation Analytical solution Error guarantees Speedups

167 Future work

168 Thank You!

169

170 Domain parameters State-to-State transitions are deterministic Action durations are p(t) = e t Time-to-deadline equals 4 time units Rewards are: 6 for returning to base and 4,2,1 for scanning Site1, Site2, Site3 respectively

A Fast Analytical Algorithm for MDPs with Continuous State Spaces

A Fast Analytical Algorithm for MDPs with Continuous State Spaces A Fast Analytical Algorithm for MDPs with Continuous State Spaces Janusz Marecki, Zvi Topol and Milind Tambe Computer Science Department University of Southern California Los Angeles, CA 989 {marecki,

More information

Towards Faster Planning with Continuous Resources in Stochastic Domains

Towards Faster Planning with Continuous Resources in Stochastic Domains Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989

More information

PLANNING WITH CONTINUOUS RESOURCES IN AGENT SYSTEMS. Janusz Marecki

PLANNING WITH CONTINUOUS RESOURCES IN AGENT SYSTEMS. Janusz Marecki PLANNING WITH CONTINUOUS RESOURCES IN AGENT SYSTEMS by Janusz Marecki A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Factored State Spaces 3/2/178

Factored State Spaces 3/2/178 Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Solving Continuous-Time Transition-Independent DEC-MDP with Temporal Constraints

Solving Continuous-Time Transition-Independent DEC-MDP with Temporal Constraints Solving Continuous-Time Transition-Independent DEC-MDP with Temporal Constraints Zhengyu Yin, Kanna Rajan, and Milind Tambe University of Southern California, Los Angeles, CA 989, USA {zhengyuy, tambe}@usc.edu

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Discrete planning (an introduction)

Discrete planning (an introduction) Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 5

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 5 CIVL - 7904/8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 5 Agenda for Today Headway Distributions Pearson Type III Composite Goodness of fit Visit to the Traffic Management Center (April **)

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

CS 234 Midterm - Winter

CS 234 Midterm - Winter CS 234 Midterm - Winter 2017-18 **Do not turn this page until you are instructed to do so. Instructions Please answer the following questions to the best of your ability. Read all the questions first before

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy. Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty

More information

STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY Ph.D. Assistant Professor Industrial and Systems Engineering Auburn University RAM IX Summit November 2 nd 2016 Outline Introduction

More information

Simulation. Stochastic scheduling example: Can we get the work done in time?

Simulation. Stochastic scheduling example: Can we get the work done in time? Simulation Stochastic scheduling example: Can we get the work done in time? Example of decision making under uncertainty, combination of algorithms and probability distributions 1 Example study planning

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

Queueing Theory. VK Room: M Last updated: October 17, 2013.

Queueing Theory. VK Room: M Last updated: October 17, 2013. Queueing Theory VK Room: M1.30 knightva@cf.ac.uk www.vincent-knight.com Last updated: October 17, 2013. 1 / 63 Overview Description of Queueing Processes The Single Server Markovian Queue Multi Server

More information

Introduction to Markov Decision Processes

Introduction to Markov Decision Processes Introduction to Markov Decision Processes Fall - 2013 Alborz Geramifard Research Scientist at Amazon.com *This work was done during my postdoc at MIT. 1 Motivation Understand the customer s need in a sequence

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Decentralized Decision Making!

Decentralized Decision Making! Decentralized Decision Making! in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation

Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation Janusz Marecki, Marek Petrik, Dharmashankar Subramanian Business Analytics and Mathematical Sciences IBM

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

Exact Solutions to Time-Dependent MDPs

Exact Solutions to Time-Dependent MDPs Exact Solutions to Time-Dependent MDPs Justin A. Boyan ITA Software Building 400 One Kendall Square Cambridge, MA 02139 jab@itasoftware.com Michael L. Littman AT&T Labs-Research and Duke University 180

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Symbolic Dynamic Programming for Discrete and Continuous State MDPs

Symbolic Dynamic Programming for Discrete and Continuous State MDPs Symbolic Dynamic Programming for Discrete and Continuous State MDPs Scott Sanner NICTA & the ANU Canberra, Australia ssanner@nictacomau Karina Valdivia Delgado University of Sao Paulo Sao Paulo, Brazil

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Experimental Design and Statistics - AGA47A

Experimental Design and Statistics - AGA47A Experimental Design and Statistics - AGA47A Czech University of Life Sciences in Prague Department of Genetics and Breeding Fall/Winter 2014/2015 Matúš Maciak (@ A 211) Office Hours: M 14:00 15:30 W 15:30

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Coordinating Randomized Policies for Increasing Security of Agent Systems

Coordinating Randomized Policies for Increasing Security of Agent Systems CREATE Research Archive Non-published Research Reports 2009 Coordinating Randomized Policies for Increasing Security of Agent Systems Praveen Paruchuri University of Southern California, paruchur@usc.edu

More information

Exponential Distribution and Poisson Process

Exponential Distribution and Poisson Process Exponential Distribution and Poisson Process Stochastic Processes - Lecture Notes Fatih Cavdur to accompany Introduction to Probability Models by Sheldon M. Ross Fall 215 Outline Introduction Exponential

More information

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Weather routing using dynamic programming to win sailing races

Weather routing using dynamic programming to win sailing races OSE SEMINAR 2013 Weather routing using dynamic programming to win sailing races Mikael Nyberg CENTER OF EXCELLENCE IN OPTIMIZATION AND SYSTEMS ENGINEERING AT ÅBO AKADEMI UNIVERSITY ÅBO NOVEMBER 15 2013

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

Function approximation

Function approximation Week 9: Monday, Mar 26 Function approximation A common task in scientific computing is to approximate a function. The approximated function might be available only through tabulated data, or it may be

More information

MATH 312 Section 7.1: Definition of a Laplace Transform

MATH 312 Section 7.1: Definition of a Laplace Transform MATH 312 Section 7.1: Definition of a Laplace Transform Prof. Jonathan Duncan Walla Walla University Spring Quarter, 2008 Outline 1 The Laplace Transform 2 The Theory of Laplace Transforms 3 Conclusions

More information

Name of the Student: Problems on Discrete & Continuous R.Vs

Name of the Student: Problems on Discrete & Continuous R.Vs Engineering Mathematics 05 SUBJECT NAME : Probability & Random Process SUBJECT CODE : MA6 MATERIAL NAME : University Questions MATERIAL CODE : JM08AM004 REGULATION : R008 UPDATED ON : Nov-Dec 04 (Scan

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Point-Based Value Iteration for Constrained POMDPs

Point-Based Value Iteration for Constrained POMDPs Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals

More information

Lecture 4 The stochastic ingredient

Lecture 4 The stochastic ingredient Lecture 4 The stochastic ingredient Luca Bortolussi 1 Alberto Policriti 2 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste Via Valerio 12/a, 34100 Trieste. luca@dmi.units.it

More information

Understanding (Exact) Dynamic Programming through Bellman Operators

Understanding (Exact) Dynamic Programming through Bellman Operators Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11 Overview 1 Value

More information

STAT 509 Section 3.4: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s.

STAT 509 Section 3.4: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s. STAT 509 Section 3.4: Continuous Distributions Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s. A continuous random variable is one for which the outcome

More information

INEQUALITY FOR VARIANCES OF THE DISCOUNTED RE- WARDS

INEQUALITY FOR VARIANCES OF THE DISCOUNTED RE- WARDS Applied Probability Trust (5 October 29) INEQUALITY FOR VARIANCES OF THE DISCOUNTED RE- WARDS EUGENE A. FEINBERG, Stony Brook University JUN FEI, Stony Brook University Abstract We consider the following

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Decision Making As An Optimization Problem

Decision Making As An Optimization Problem Decision Making As An Optimization Problem Hala Mostafa 683 Lecture 14 Wed Oct 27 th 2010 DEC-MDP Formulation as a math al program Often requires good insight into the problem to get a compact well-behaved

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Stochastic Simulation.

Stochastic Simulation. Stochastic Simulation. (and Gillespie s algorithm) Alberto Policriti Dipartimento di Matematica e Informatica Istituto di Genomica Applicata A. Policriti Stochastic Simulation 1/20 Quote of the day D.T.

More information

Optimization of the Temporal Shape of Laser Pulses for Ablation

Optimization of the Temporal Shape of Laser Pulses for Ablation Optimization of the Temporal Shape of Laser Pulses for Ablation Company: Institut nationale d optique Participants: Reynaldo Arteaga, Guillaume Blanchet, Francois Fillion- Gourdeau, Ludovick Gagnon, Claude

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Resource Allocation In Trait Introgression A Markov Decision Process Approach

Resource Allocation In Trait Introgression A Markov Decision Process Approach Resource Allocation In Trait Introgression A Markov Decision Process Approach Ye Han Iowa State University yeh@iastateedu Nov 29, 2016 Ye Han (ISU) MDP in Trait Introgression Nov 29, 2016 1 / 27 Acknowledgement

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Planning and Model Selection in Data Driven Markov models

Planning and Model Selection in Data Driven Markov models Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),

More information

Data Structures for Efficient Inference and Optimization

Data Structures for Efficient Inference and Optimization Data Structures for Efficient Inference and Optimization in Expressive Continuous Domains Scott Sanner Ehsan Abbasnejad Zahra Zamani Karina Valdivia Delgado Leliane Nunes de Barros Cheng Fang Discrete

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

Lecture 48 Sections Mon, Nov 16, 2009

Lecture 48 Sections Mon, Nov 16, 2009 and and Lecture 48 Sections 13.4-13.5 Hampden-Sydney College Mon, Nov 16, 2009 Outline and 1 2 3 4 5 6 Outline and 1 2 3 4 5 6 and Exercise 13.4, page 821. The following data represent trends in cigarette

More information