The Art of Sequential Optimization via Simulations
|
|
- Philip Clark
- 6 years ago
- Views:
Transcription
1 The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint work with Dileep Kalathil (UC Berkeley), W. Haskell (NU Singapore), V. Borkar (IITB), A. Gupta (Ohio State) & P. Glynn (Stanford)) Nov 19, 2015 (*by courtesy) 1
2 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 2
3 Planning in Stochastic Environments States=1200, Actions=4 Large state spaces, Small action spaces What is the strategy to achieve max. expected reward? States, Actions, Transitions, Rewards What is the solution? Policy: a map from a state to an action 3
4 Markov Decision Processes: A Formal Model a 0 x 0 x 1 r(x 0,a 0 ) a 1 r(x 1,a 1 ) x 2 An MDP with State space X, initial distribution λ, Action space A State transition probability Q(y x,a) Reward function r(x,a) Fixed, stationary policies "(a;x) Value of policy, Objective: sup! V! 4
5 Dynamic Programming `Principle of Optimality (Bellman, 1959) 5
6 Dynamic Programming State value function of a policy " Optimal value function The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5
7 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5
8 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5
9 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5
10 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) E[V*(y) x,a] 5
11 DP and the Bellman Operator 0 t at ΨP 0 t+1 i t Let Ψ:X A R X be a simulation model for transition kernel Q The Bellman operator is DP equation is now a fixed point equation 6
12 The Value Iteration Algorithm Bellman operator T is a contraction operator TV 1 TV 2 < V 1 V 2 Value Iteration: V k+1 (x) =[TV k ](x) :=sup{r(x, a)+ E! [V k ( (x, a,!)]} a 7
13 Online/Approximate Dynamic Programming DP methods known to suffer from a curse of dimensionality Approximate DP Bertsekas-Tsitsiklis [NDP, 1994], Powell [ADP, 2011], Reinforcement Learning Q-Learning, Temporal Differences, etc. (Szepesvari [ARL, 2010]) Stochastic approximation-based schemes Slow rate of convergence 8
14 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 9
15 Empirical Value Iteration Dynamic programming by simulation EVI: ˆV k+1 (x) = [ˆT ˆV k ](x) where ω s are i.i.d. noise RVs is a random sequence := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} is a random monotone operator, E[ b T (V )] 6= T (V ) Non-incremental updates 10
16 Questions 1. What is the behavior of 2. Is there a relevant notion of (probabilistic) fixed point for a random (empirical Bellman) operator? 3. How does it relate to the fixed point of the classical Bellman operator? 4. Can we give a sample complexity n? And how many iterations over k do we need for a reasonable approximation? 11
17 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 12
18 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP v k v* / v* EVI, n =1 EVI, n = 5 Exact Value Iteration Number of iterations 12
19 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP EPI, n = m = 5 v k v* / v* EVI, n =1 EVI, n = 5 v k v* / v* Exact Policy Iteration EPI, n = m = 1 Exact Value Iteration Number of iterations Number of iterations 12
20 How do they compare? v k v* / v* Policy iteration EPI n = m = 5 EVI, n = 5 Exact Value Iteration QL, = 0.5 OPI, n = m = Number of iterations States=100, Actions=5, random MDP Offline QL with n=5 samples/iteration: 13
21 Actual Runtime Simulation Time Comparison EPI: very slow LP method: Even worse 160 Simulation Time (seconds) S =5000, A =10 EVI Exact VI QL Relative Error S =5000, A = Number of Iterations EVI, n=1 EVI, n=5 EVI, n=10 EVI, n=20 QL, n=20 VI 20 n=18 n=10 n=6 n=5 n= Relative Error(%) States=5000, Actions=10, random MDP. All simulations run on a Macbook Pro under identical conditions 14
22 The Empirical Bellman Operator and its Iterations Q. Can we prove convergence? This is like multiplying random matrices together Q. Will this product converge? 15
23 Probabilistic Fixed Points of Random Operators 16
24 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of 16
25 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16
26 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16
27 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16
28 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16
29 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16
30 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16
31 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16
32 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16
33 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point 16
34 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point These (asymptotic) notions coincide! 16
35 Sample Complexity of EVI Theorem: Given ϵ (0, 1), δ (0, 1), select n C 1 2 log 2 X A, k log 1 µ n,min Then, P( ˆV k V apple ) 1. `Sample Complexity of EVI: O( 1 2, log 1, log X A ) No assumptions on MDPs needed! W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear,
36 And Construct a Dominating Markov Chain Discrete Error Process X k n Dominating Markov Chain p n x Y k n η*... N* p n 1 1-p n Key idea is construction of a Markov chain {Ykn } t 0 that stochastically dominates the {X kn } t 0 in probability: P(X k n z) P(Y k n z), for all z {Ykn } t 0 is easier to analyze than {X kn } t 0, show it converges to zero in probability as n,k 18
37 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 19
38 Asynchronous EVI Online EVI: update only one state at a time ˆV k+1 (x) = [ˆT ˆV k ](x) := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} Theorem. If each state is visited infinitely often, P ˆV k! V. Proof now relies on now defining a new random operator that is product of the empirical Bellman operator between hitting times W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear:
39 Numerical Performance of Online EVI Asynchronous EVI and QL (Updating only 1 state in each iteration) 1 Asynchrouns EVI and QL (Updating only 10 states in each iterations) Relative Error Relative Error EVI, n=10 QL, n=10, =0.6 EVI, n=1 EVI, n=5 EVI, n=10 QL, n=10, = Number of Iterations Number of Iterations States=500, Actions=10, random MDP 21
40 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 22
41 Q-Value Iteration Q-Value of policy " Q (x, a) =E[ 1X t=0 Q-value operator, G t r(x t,a t = (x t )) x 0 = x, a 0 = a] Q (x, a) =sup V (x) = max a2a Q (x, a) G(Q)(x, a) :=r(x, a)+ Optimal Q* is fixed point of G, a contraction Q (x) = arg max a2a Q (x, a) X y P (y x, a) max b Q(y, b) 23
42 Empirical Q-Value Iteration (EQVI) EQVI: Simulation-based Q-value iteration where ω s are i.i.d. noise RVs bg is a random (monotone) operator bq 0, Q b 1, Q b 2,... is a random sequence Non-incremental updates vs. QL: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov
43 Numerical Comparison: EQVI vs QL 0.5 Comparison: EQVI and QL (Synchronous) Relative Error: Q t Q * / Q * S = 500, A = 10 EQVI, n=20 EQVI, n=10 EQVI, n=5 Exact QI QL, n= Number of Iterations Speedup = 10x+ 25
44 Online EQVI vs QL 1 Comparison: EQVI and QL (Asynchrounous) Relative Error: Q t Q * / Q * EQVI, n=20 QL, n=20 Exact QI S = 500, A = Number of Iteration An online version of EQVI Speedup = 100x+? Converges in probability under suitable recurrence conditions. 26
45 Other Extensions Continuous State Space MDPs State Aggregation: Construct an ε-net, perform EVI on the ε-net [Haskell, J. & Sharma (2015)] Function Approximation [Szespesvari & Munos 08] Kernel-Based function approximation Deep Neural Network-based function approximation: DEEP EVI/EQVI Average-case: More complicated, similar numerical performance gains [Gupta, J. & Glynn (2015)] 27
46 Conclusions Empirical Dynamic Programming Algorithms A ``natural way of doing Approximate Dynamic programming via simulations Iteration of Random operators Stochastically dominating MC method is a fairly general technique Extensions to Online Algorithms for Model-free settings Extension to Continuous State Space MDPs Doesn t solve all ``curses of dimensionality Surprisingly good numerical performance Weaker notion of convergence W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Math. of Operations Research, to appear: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov
47 29
48 The Cleverest thing to do the simplest one. 29
An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More information6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationBayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29
More informationIntroduction to Approximate Dynamic Programming
Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationValue Function Based Reinforcement Learning in Changing Markovian Environments
Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji
More informationOptimistic Policy Iteration and Q-learning in Dynamic Programming
Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,
More informationAlgorithms for MDPs and Their Convergence
MS&E338 Reinforcement Learning Lecture 2 - April 4 208 Algorithms for MDPs and Their Convergence Lecturer: Ben Van Roy Scribe: Matthew Creme and Kristen Kessel Bellman operators Recall from last lecture
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationAn Empirical Dynamic Programming Algorithm for Continuous MDPs
An Empirical Dynamic Programming Algorithm for Continuous MDPs William B. Haskell, Rahul Jain, Hiteshi harma and Pengqian Yu arxiv:709.07506v math.oc ep 07 Abstract We propose universal randomized function
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationInfinite-Horizon Average Reward Markov Decision Processes
Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average
More informationMarkovian Decision Process (MDP): theory and applications to wireless networks
Markovian Decision Process (MDP): theory and applications to wireless networks Philippe Ciblat Joint work with I. Fawaz, N. Ksairi, C. Le Martret, M. Sarkiss Outline Examples MDP Applications A few examples
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationLearning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach
More informationCS 4649/7649 Robot Intelligence: Planning
CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationBias-Variance Error Bounds for Temporal Difference Updates
Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds
More informationChapter 13 Wow! Least Squares Methods in Batch RL
Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationINTRODUCTION TO MARKOV DECISION PROCESSES
INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationGradient Estimation for Attractor Networks
Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationA Generalized Reduced Linear Program for Markov Decision Processes
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence A Generalized Reduced Linear Program for Markov Decision Processes Chandrashekar Lakshminarayanan and Shalabh Bhatnagar Department
More informationApproximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.
An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu
More information16.410/413 Principles of Autonomy and Decision Making
16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December
More informationCentral-limit approach to risk-aware Markov decision processes
Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationNovember 28 th, Carlos Guestrin 1. Lower dimensional projections
PCA Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 1 Lower dimensional projections Rather than picking a subset of the features, we can new features that are
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationSpeedy Q-Learning. Abstract
Speedy Q-Learning Mohammad Gheshlaghi Azar Radboud University Nijmegen Geert Grooteplein N, 6 EZ Nijmegen, Netherlands m.azar@science.ru.nl Mohammad Ghavamzadeh INRIA Lille, SequeL Project 40 avenue Halley
More informationPartially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS
Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal
More informationConsistency of Fuzzy Model-Based Reinforcement Learning
Consistency of Fuzzy Model-Based Reinforcement Learning Lucian Buşoniu, Damien Ernst, Bart De Schutter, and Robert Babuška Abstract Reinforcement learning (RL) is a widely used paradigm for learning control.
More informationAlgorithms for Reinforcement Learning
Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June
More informationAlmost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms
Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationReview: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]
Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationOpen Problem: Approximate Planning of POMDPs in the class of Memoryless Policies
Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies Kamyar Azizzadenesheli U.C. Irvine Joint work with Prof. Anima Anandkumar and Dr. Alessandro Lazaric. Motivation +1 Agent-Environment
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationComputational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes
Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationMaximum Margin Planning
Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement
More informationQ-Learning and Stochastic Approximation
MS&E338 Reinforcement Learning Lecture 4-04.11.018 Q-Learning and Stochastic Approximation Lecturer: Ben Van Roy Scribe: Christopher Lazarus Javier Sagastuy In this lecture we study the convergence of
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationProbabilistic Planning. George Konidaris
Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t
More informationMarkov Decision Processes and their Applications to Supply Chain Management
Markov Decision Processes and their Applications to Supply Chain Management Jefferson Huang School of Operations Research & Information Engineering Cornell University June 24 & 25, 2018 10 th Operations
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationReinforcement Learning as Variational Inference: Two Recent Approaches
Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing
More informationAbstract Dynamic Programming
Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"
More informationarxiv: v2 [cs.lg] 6 Sep 2011
Journal of Machine Learning Research??? (???)??? Submitted???; Published??? Dynamic Policy Programming arxiv:1004.2027v2 [cs.lg] 6 Sep 2011 Mohammad Gheshlaghi Azar Vicenç Gómez Hilbert J. Kappen Department
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationOptimal sequential decision making for complex problems agents. Damien Ernst University of Liège
Optimal sequential decision making for complex problems agents Damien Ernst University of Liège Email: dernst@uliege.be 1 About the class Regular lectures notes about various topics on the subject with
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems
More information