Janusz Marecki Zvi Topol
|
|
- Merilyn Bates
- 5 years ago
- Views:
Transcription
1 Welcome
2
3 Janusz Marecki
4 Janusz Marecki Zvi Topol
5 Janusz Marecki Zvi Topol Milind Tambe
6
7 Solving MDPs with Continuous Time
8 Why do I care about continuous time?
9 30 min
10 At the airport
11 10:45 12:00 Start 10:15
12 10:45 10:46 10:47 10:48 10:49 10:50 10:51
13
14 Action durations = Uncertainty
15 Challenging planning problems
16 Existing work = Numerical solutions
17 This work = Analytical solutions
18 Huge speedups
19 Outline
20 Domain Model CPH Solver Results Summary
21 Mars rover exploration
22 Mars
23 Landing site
24 Sites of Base interest
25 Lander Basebase
26 Base Lander Basebase
27 Base Exploration Base sites
28 Site1 Base Site2 Site3 Exploration Base sites
29 Site1 Base Site2 Site3 Rover Base location
30 Actions
31 Site1 Base Site2 Site3 Move to Base next Site
32 Site1 Base Site2 Site3 Move to Base next Site
33 Site1 Base Site2 Site3 Return Base to Base
34 Site1 Base Site2 Site3 Return Base to Base
35 Action outcomes = uncertain
36 State B State A State C
37 State B State A State C
38 State B? State A State C?
39 Action durations = uncertain
40 State A State B
41 State A State B?
42 Rewards
43 Explore Site Return to Base
44 Achieved upon action completion
45 Finally
46
47 Deadline
48 Domain Model CPH Solver Results Summary
49 Action duration p(t) Deterministic Stochastic and Discrete Stochastic and Continuous t
50 Action duration p(t) Deterministic Stochastic and Discrete Stochastic and Continuous t
51 Action duration p(t) Deterministic Stochastic and Discrete Stochastic and Continuous t
52 Deadlines
53 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP
54 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP
55 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP
56 Unrealistic
57
58 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP? No Deadline MDP MDP Semi MDP
59 No quality guarantees
60 Number of states blows up
61
62 Action Durations Deterministic Stochastic discrete Stochastic continuous Deadline MDP Time Dependent MDP Time Dependent MDP No Deadline MDP MDP Semi MDP
63 Deadline Stochastic continuous
64 Deadline Stochastic continuous
65 Stochastic continuous + Deadline Difficult Problem
66 Why?
67 Policy depends on State s Time-to-deadline t
68 Policy value at s,t
69 V(s)(t) Policy value at s,t
70 V(s)(t) Policy value at s,t V(s) Function over t
71 V(s) V(s)(t) t
72 How to find V(s)?
73 Bellman update
74 Suppose s precedes s
75 s s
76 We assume V(s) V(s)(t) t
77 We derive V(s ) V(s )(t) t
78 We derive V(s ) V(s )(t) ? t
79 Action duration p(t)
80 Q: How to derive V(s )(t)? A: Convolution
81
82 In s time-to-deadline = t 0 t
83 In s time-to-deadline = t Action may consume t p(t ) 0 t' t
84 In s time-to-deadline = t Action may consume t In s time-to-deadline = t -t p(t ) V(s)(t-t ) 0 t' t
85 p(t ) V(s )(t-t )
86 t 0 p(t ) V(s )(t-t ) dt
87 Convolution V(s)(t) = t 0 p(t ) V(s )(t-t ) dt
88 Convolution V(s ) = p * V(s)
89 Computing convolutions Numerical methods Approximation Error guarantees
90 Examples
91 Outcome of convolution p(t) V(s)(t) * t t
92 Numerical methods p(t) p(t) p(t) p(t) t t t Discrete Constant Linear... t
93 Numerical methods Better approximation p(t) p(t) p(t) p(t) t t t Discrete Constant Linear... t
94
95 Better approximation Better approximation
96 Better approximation Better approximation p*v(s) function class
97 Discrete Constant Linear Quadratic Discrete Constant Linear Quadratic
98 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic
99 Discrete Constant Linear Quadratic Discrete Discrete Constant Constant Constant Linear Quadratic
100 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Constant Constant Linear Linear Linear Quadratic
101 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Linear Linear Quadratic Quadratic Quadratic
102 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quadratic Quadratic Cubic
103 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic
104 Better approximation = Intractability
105 Better approximation = Intractability Representation & Dominancy
106 Existing work Discrete p(t) Repeated approximation
107 Discrete p(t)
108 Boyan 02
109 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic
110 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic V(s) and p*v(s) Linear
111 Repeated approximation
112 Lazy Approximation Li 05
113 Big improvement over Discrete p(t)
114 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic
115 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic V(s) Constant p*v(s) Linear
116 Discrete Constant Linear Quadratic Discrete Discrete Constant Linear Quadratic Constant Constant Linear Quadratic Cubic Linear Linear Quadratic Cubic Quartic Quadratic Quadratic Cubic Quartic p*v(s) Approximated to Constant
117 Fastest algorithm with quality guarantees
118 Better approximation = Intractability?
119 Not necessary -I get around this tradeoff -I am using a completely different solution technique -I am going in complete different direction
120 Domain Model CPH Solver Results Summary
121 Key Ideas 1. Phase-Type approximation of p(t) 2. Analytical convolution of p*v(s)
122 1 Phase-Type approximation of p(t)
123 MDP M Approximation MDP M
124 MDP M Approximation MDP M Action durations p(t) = λe λt
125 Suppose a transition in M s1 s2
126 Suppose a transition in M s1 s2 = p (t)
127 What if p (t) λe λt
128 Example
129 Normal Distribution p (t) Mean = 2 Variance = 1
130
131
132 p(t) = 1.37 e 1.37t
133 New transition time from s1 to s2?
134 Approximated p (t)
135 Comparison appr. p (t) p (t)
136 Comparison of p (t) appr. p (t) p (t)
137 Phase-Type approximation More phases = Better approximation Introduce self-transitions Planning horizon?
138 Planning horizon n* Policy less than ɛ away from optimal We have found n*
139 Proof in the paper
140 Rmax = maximum action reward = time to deadline
141 n log e λ 1 e λ ɛ R max (e λ 1)
142 1 Phase-Type approximation of p(t)
143 2 Analytical convolution of p * V(s)
144 Fast convolutions!
145 Action durations p(t) = λe λt
146 We proved 2 things
147 First
148 V(s)(t) t0 t time to deadline
149 V(s) is piecewise V(s)(t) t0 t1 t2 t V1(s) V2(s) V3(s)
150 Each piece Vi(s) = Gamma function
151 Gamma function V i (s)(t) = c s,i,1 e λt ( c s,i,2 + c s,i,3 (λt) c s,i,n+1 (λt) n 1 (n 1)! ). Stored in vector [cs,i,1, cs,i,2, cs,i,3,..., cs,i,n+1 ]
152 V(s) = Piecewise Gamma V(s) = t0 : [cs,0,1, cs,0,2, cs,0,3,..., cs,0,n+1 ] t1 : [cs,0,1, cs,0,2, cs,0,3,..., cs,0,n+1 ]... tm: [cs,m,1, cs,m,2, cs,m,3,..., cs,m,n+1 ]
153 Second
154 V(s ) = p*v(s) Derived analytically Simple vector operations
155 V(s ) t0 : [c s,0,1, c s,0,2,..., c s,0,n+1 ] t1 : [c s,0,1, c s,0,2,..., c s,0,n+1 ]... tm: [c s,m,1, c s,m,2,..., c s,m,n+1 ] V(s) t0 : [cs,0,1, cs,0,2,..., cs,0,n+1 ] t1 : [cs,0,1, cs,0,2,..., cs,0,n+1 ]... tm: [cs,m,1, cs,m,2,..., cs,m,n+1 ]
156 Proof in the paper
157 Algorithm
158 Significant speedups
159 Domain Model CPH Solver Results Summary
160 Experiment 1 Correctness of CPH
161 Experiment 2 Action durations - Exponential
162 Experiment 3 Action durations - Weibull
163 Experiment 4 Action durations - Normal
164 Speedups over all distributions
165 Domain Model CPH Solver Results Summary
166 Summary Continuous Time = Important Problem Phase Type approximation Analytical solution Error guarantees Speedups
167 Future work
168 Thank You!
169
170 Domain parameters State-to-State transitions are deterministic Action durations are p(t) = e t Time-to-deadline equals 4 time units Rewards are: 6 for returning to base and 4,2,1 for scanning Site1, Site2, Site3 respectively
A Fast Analytical Algorithm for MDPs with Continuous State Spaces
A Fast Analytical Algorithm for MDPs with Continuous State Spaces Janusz Marecki, Zvi Topol and Milind Tambe Computer Science Department University of Southern California Los Angeles, CA 989 {marecki,
More informationTowards Faster Planning with Continuous Resources in Stochastic Domains
Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989
More informationPLANNING WITH CONTINUOUS RESOURCES IN AGENT SYSTEMS. Janusz Marecki
PLANNING WITH CONTINUOUS RESOURCES IN AGENT SYSTEMS by Janusz Marecki A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements
More informationProbabilistic Planning. George Konidaris
Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationFactored State Spaces 3/2/178
Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationSolving Continuous-Time Transition-Independent DEC-MDP with Temporal Constraints
Solving Continuous-Time Transition-Independent DEC-MDP with Temporal Constraints Zhengyu Yin, Kanna Rajan, and Milind Tambe University of Southern California, Los Angeles, CA 989, USA {zhengyuy, tambe}@usc.edu
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and
More informationOptimally Solving Dec-POMDPs as Continuous-State MDPs
Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationNotes on Tabular Methods
Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from
More informationDiscrete planning (an introduction)
Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationCIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 5
CIVL - 7904/8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 5 Agenda for Today Headway Distributions Pearson Type III Composite Goodness of fit Visit to the Traffic Management Center (April **)
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationPROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School
PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationCS 234 Midterm - Winter
CS 234 Midterm - Winter 2017-18 **Do not turn this page until you are instructed to do so. Instructions Please answer the following questions to the best of your ability. Read all the questions first before
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationOptimizing Memory-Bounded Controllers for Decentralized POMDPs
Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationCSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam
CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationRL 14: POMDPs continued
RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationMarkov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.
Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty
More informationSTOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY
STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY Ph.D. Assistant Professor Industrial and Systems Engineering Auburn University RAM IX Summit November 2 nd 2016 Outline Introduction
More informationSimulation. Stochastic scheduling example: Can we get the work done in time?
Simulation Stochastic scheduling example: Can we get the work done in time? Example of decision making under uncertainty, combination of algorithms and probability distributions 1 Example study planning
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark
More informationQueueing Theory. VK Room: M Last updated: October 17, 2013.
Queueing Theory VK Room: M1.30 knightva@cf.ac.uk www.vincent-knight.com Last updated: October 17, 2013. 1 / 63 Overview Description of Queueing Processes The Single Server Markovian Queue Multi Server
More informationIntroduction to Markov Decision Processes
Introduction to Markov Decision Processes Fall - 2013 Alborz Geramifard Research Scientist at Amazon.com *This work was done during my postdoc at MIT. 1 Motivation Understand the customer s need in a sequence
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationMulti-Armed Bandit: Learning in Dynamic Systems with Unknown Models
c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University
More informationDecentralized Decision Making!
Decentralized Decision Making! in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationStochastic Primal-Dual Methods for Reinforcement Learning
Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,
More informationSolution Methods for Constrained Markov Decision Process with Continuous Probability Modulation
Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation Janusz Marecki, Marek Petrik, Dharmashankar Subramanian Business Analytics and Mathematical Sciences IBM
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationCS 188 Introduction to Fall 2007 Artificial Intelligence Midterm
NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.
More informationExact Solutions to Time-Dependent MDPs
Exact Solutions to Time-Dependent MDPs Justin A. Boyan ITA Software Building 400 One Kendall Square Cambridge, MA 02139 jab@itasoftware.com Michael L. Littman AT&T Labs-Research and Duke University 180
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationSymbolic Dynamic Programming for Discrete and Continuous State MDPs
Symbolic Dynamic Programming for Discrete and Continuous State MDPs Scott Sanner NICTA & the ANU Canberra, Australia ssanner@nictacomau Karina Valdivia Delgado University of Sao Paulo Sao Paulo, Brazil
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationPracticable Robust Markov Decision Processes
Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationExperimental Design and Statistics - AGA47A
Experimental Design and Statistics - AGA47A Czech University of Life Sciences in Prague Department of Genetics and Breeding Fall/Winter 2014/2015 Matúš Maciak (@ A 211) Office Hours: M 14:00 15:30 W 15:30
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationCoordinating Randomized Policies for Increasing Security of Agent Systems
CREATE Research Archive Non-published Research Reports 2009 Coordinating Randomized Policies for Increasing Security of Agent Systems Praveen Paruchuri University of Southern California, paruchur@usc.edu
More informationExponential Distribution and Poisson Process
Exponential Distribution and Poisson Process Stochastic Processes - Lecture Notes Fatih Cavdur to accompany Introduction to Probability Models by Sheldon M. Ross Fall 215 Outline Introduction Exponential
More informationSymbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning
Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus
More informationReview: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]
Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r
More informationWeather routing using dynamic programming to win sailing races
OSE SEMINAR 2013 Weather routing using dynamic programming to win sailing races Mikael Nyberg CENTER OF EXCELLENCE IN OPTIMIZATION AND SYSTEMS ENGINEERING AT ÅBO AKADEMI UNIVERSITY ÅBO NOVEMBER 15 2013
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark
More informationFunction approximation
Week 9: Monday, Mar 26 Function approximation A common task in scientific computing is to approximate a function. The approximated function might be available only through tabulated data, or it may be
More informationMATH 312 Section 7.1: Definition of a Laplace Transform
MATH 312 Section 7.1: Definition of a Laplace Transform Prof. Jonathan Duncan Walla Walla University Spring Quarter, 2008 Outline 1 The Laplace Transform 2 The Theory of Laplace Transforms 3 Conclusions
More informationName of the Student: Problems on Discrete & Continuous R.Vs
Engineering Mathematics 05 SUBJECT NAME : Probability & Random Process SUBJECT CODE : MA6 MATERIAL NAME : University Questions MATERIAL CODE : JM08AM004 REGULATION : R008 UPDATED ON : Nov-Dec 04 (Scan
More informationHeuristic Search Algorithms
CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationPoint-Based Value Iteration for Constrained POMDPs
Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals
More informationLecture 4 The stochastic ingredient
Lecture 4 The stochastic ingredient Luca Bortolussi 1 Alberto Policriti 2 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste Via Valerio 12/a, 34100 Trieste. luca@dmi.units.it
More informationUnderstanding (Exact) Dynamic Programming through Bellman Operators
Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11 Overview 1 Value
More informationSTAT 509 Section 3.4: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s.
STAT 509 Section 3.4: Continuous Distributions Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s. A continuous random variable is one for which the outcome
More informationINEQUALITY FOR VARIANCES OF THE DISCOUNTED RE- WARDS
Applied Probability Trust (5 October 29) INEQUALITY FOR VARIANCES OF THE DISCOUNTED RE- WARDS EUGENE A. FEINBERG, Stony Brook University JUN FEI, Stony Brook University Abstract We consider the following
More informationMarkov decision processes and interval Markov chains: exploiting the connection
Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationDecision Making As An Optimization Problem
Decision Making As An Optimization Problem Hala Mostafa 683 Lecture 14 Wed Oct 27 th 2010 DEC-MDP Formulation as a math al program Often requires good insight into the problem to get a compact well-behaved
More informationCentral-limit approach to risk-aware Markov decision processes
Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationStochastic Simulation.
Stochastic Simulation. (and Gillespie s algorithm) Alberto Policriti Dipartimento di Matematica e Informatica Istituto di Genomica Applicata A. Policriti Stochastic Simulation 1/20 Quote of the day D.T.
More informationOptimization of the Temporal Shape of Laser Pulses for Ablation
Optimization of the Temporal Shape of Laser Pulses for Ablation Company: Institut nationale d optique Participants: Reynaldo Arteaga, Guillaume Blanchet, Francois Fillion- Gourdeau, Ludovick Gagnon, Claude
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationResource Allocation In Trait Introgression A Markov Decision Process Approach
Resource Allocation In Trait Introgression A Markov Decision Process Approach Ye Han Iowa State University yeh@iastateedu Nov 29, 2016 Ye Han (ISU) MDP in Trait Introgression Nov 29, 2016 1 / 27 Acknowledgement
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationPlanning and Model Selection in Data Driven Markov models
Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),
More informationData Structures for Efficient Inference and Optimization
Data Structures for Efficient Inference and Optimization in Expressive Continuous Domains Scott Sanner Ehsan Abbasnejad Zahra Zamani Karina Valdivia Delgado Leliane Nunes de Barros Cheng Fang Discrete
More informationTutorial on Policy Gradient Methods. Jan Peters
Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup
More informationLecture 48 Sections Mon, Nov 16, 2009
and and Lecture 48 Sections 13.4-13.5 Hampden-Sydney College Mon, Nov 16, 2009 Outline and 1 2 3 4 5 6 Outline and 1 2 3 4 5 6 and Exercise 13.4, page 821. The following data represent trends in cigarette
More information