The Art of Sequential Optimization via Simulations

Size: px
Start display at page:

Download "The Art of Sequential Optimization via Simulations"

Transcription

1 The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint work with Dileep Kalathil (UC Berkeley), W. Haskell (NU Singapore), V. Borkar (IITB), A. Gupta (Ohio State) & P. Glynn (Stanford)) Nov 19, 2015 (*by courtesy) 1

2 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 2

3 Planning in Stochastic Environments States=1200, Actions=4 Large state spaces, Small action spaces What is the strategy to achieve max. expected reward? States, Actions, Transitions, Rewards What is the solution? Policy: a map from a state to an action 3

4 Markov Decision Processes: A Formal Model a 0 x 0 x 1 r(x 0,a 0 ) a 1 r(x 1,a 1 ) x 2 An MDP with State space X, initial distribution λ, Action space A State transition probability Q(y x,a) Reward function r(x,a) Fixed, stationary policies "(a;x) Value of policy, Objective: sup! V! 4

5 Dynamic Programming `Principle of Optimality (Bellman, 1959) 5

6 Dynamic Programming State value function of a policy " Optimal value function The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

7 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

8 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

9 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

10 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) E[V*(y) x,a] 5

11 DP and the Bellman Operator 0 t at ΨP 0 t+1 i t Let Ψ:X A R X be a simulation model for transition kernel Q The Bellman operator is DP equation is now a fixed point equation 6

12 The Value Iteration Algorithm Bellman operator T is a contraction operator TV 1 TV 2 < V 1 V 2 Value Iteration: V k+1 (x) =[TV k ](x) :=sup{r(x, a)+ E! [V k ( (x, a,!)]} a 7

13 Online/Approximate Dynamic Programming DP methods known to suffer from a curse of dimensionality Approximate DP Bertsekas-Tsitsiklis [NDP, 1994], Powell [ADP, 2011], Reinforcement Learning Q-Learning, Temporal Differences, etc. (Szepesvari [ARL, 2010]) Stochastic approximation-based schemes Slow rate of convergence 8

14 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 9

15 Empirical Value Iteration Dynamic programming by simulation EVI: ˆV k+1 (x) = [ˆT ˆV k ](x) where ω s are i.i.d. noise RVs is a random sequence := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} is a random monotone operator, E[ b T (V )] 6= T (V ) Non-incremental updates 10

16 Questions 1. What is the behavior of 2. Is there a relevant notion of (probabilistic) fixed point for a random (empirical Bellman) operator? 3. How does it relate to the fixed point of the classical Bellman operator? 4. Can we give a sample complexity n? And how many iterations over k do we need for a reasonable approximation? 11

17 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 12

18 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP v k v* / v* EVI, n =1 EVI, n = 5 Exact Value Iteration Number of iterations 12

19 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP EPI, n = m = 5 v k v* / v* EVI, n =1 EVI, n = 5 v k v* / v* Exact Policy Iteration EPI, n = m = 1 Exact Value Iteration Number of iterations Number of iterations 12

20 How do they compare? v k v* / v* Policy iteration EPI n = m = 5 EVI, n = 5 Exact Value Iteration QL, = 0.5 OPI, n = m = Number of iterations States=100, Actions=5, random MDP Offline QL with n=5 samples/iteration: 13

21 Actual Runtime Simulation Time Comparison EPI: very slow LP method: Even worse 160 Simulation Time (seconds) S =5000, A =10 EVI Exact VI QL Relative Error S =5000, A = Number of Iterations EVI, n=1 EVI, n=5 EVI, n=10 EVI, n=20 QL, n=20 VI 20 n=18 n=10 n=6 n=5 n= Relative Error(%) States=5000, Actions=10, random MDP. All simulations run on a Macbook Pro under identical conditions 14

22 The Empirical Bellman Operator and its Iterations Q. Can we prove convergence? This is like multiplying random matrices together Q. Will this product converge? 15

23 Probabilistic Fixed Points of Random Operators 16

24 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of 16

25 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16

26 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16

27 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

28 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

29 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

30 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

31 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

32 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

33 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point 16

34 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point These (asymptotic) notions coincide! 16

35 Sample Complexity of EVI Theorem: Given ϵ (0, 1), δ (0, 1), select n C 1 2 log 2 X A, k log 1 µ n,min Then, P( ˆV k V apple ) 1. `Sample Complexity of EVI: O( 1 2, log 1, log X A ) No assumptions on MDPs needed! W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear,

36 And Construct a Dominating Markov Chain Discrete Error Process X k n Dominating Markov Chain p n x Y k n η*... N* p n 1 1-p n Key idea is construction of a Markov chain {Ykn } t 0 that stochastically dominates the {X kn } t 0 in probability: P(X k n z) P(Y k n z), for all z {Ykn } t 0 is easier to analyze than {X kn } t 0, show it converges to zero in probability as n,k 18

37 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 19

38 Asynchronous EVI Online EVI: update only one state at a time ˆV k+1 (x) = [ˆT ˆV k ](x) := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} Theorem. If each state is visited infinitely often, P ˆV k! V. Proof now relies on now defining a new random operator that is product of the empirical Bellman operator between hitting times W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear:

39 Numerical Performance of Online EVI Asynchronous EVI and QL (Updating only 1 state in each iteration) 1 Asynchrouns EVI and QL (Updating only 10 states in each iterations) Relative Error Relative Error EVI, n=10 QL, n=10, =0.6 EVI, n=1 EVI, n=5 EVI, n=10 QL, n=10, = Number of Iterations Number of Iterations States=500, Actions=10, random MDP 21

40 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 22

41 Q-Value Iteration Q-Value of policy " Q (x, a) =E[ 1X t=0 Q-value operator, G t r(x t,a t = (x t )) x 0 = x, a 0 = a] Q (x, a) =sup V (x) = max a2a Q (x, a) G(Q)(x, a) :=r(x, a)+ Optimal Q* is fixed point of G, a contraction Q (x) = arg max a2a Q (x, a) X y P (y x, a) max b Q(y, b) 23

42 Empirical Q-Value Iteration (EQVI) EQVI: Simulation-based Q-value iteration where ω s are i.i.d. noise RVs bg is a random (monotone) operator bq 0, Q b 1, Q b 2,... is a random sequence Non-incremental updates vs. QL: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov

43 Numerical Comparison: EQVI vs QL 0.5 Comparison: EQVI and QL (Synchronous) Relative Error: Q t Q * / Q * S = 500, A = 10 EQVI, n=20 EQVI, n=10 EQVI, n=5 Exact QI QL, n= Number of Iterations Speedup = 10x+ 25

44 Online EQVI vs QL 1 Comparison: EQVI and QL (Asynchrounous) Relative Error: Q t Q * / Q * EQVI, n=20 QL, n=20 Exact QI S = 500, A = Number of Iteration An online version of EQVI Speedup = 100x+? Converges in probability under suitable recurrence conditions. 26

45 Other Extensions Continuous State Space MDPs State Aggregation: Construct an ε-net, perform EVI on the ε-net [Haskell, J. & Sharma (2015)] Function Approximation [Szespesvari & Munos 08] Kernel-Based function approximation Deep Neural Network-based function approximation: DEEP EVI/EQVI Average-case: More complicated, similar numerical performance gains [Gupta, J. & Glynn (2015)] 27

46 Conclusions Empirical Dynamic Programming Algorithms A ``natural way of doing Approximate Dynamic programming via simulations Iteration of Random operators Stochastically dominating MC method is a fairly general technique Extensions to Online Algorithms for Model-free settings Extension to Continuous State Space MDPs Doesn t solve all ``curses of dimensionality Surprisingly good numerical performance Weaker notion of convergence W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Math. of Operations Research, to appear: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov

47 29

48 The Cleverest thing to do the simplest one. 29

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,

More information

Algorithms for MDPs and Their Convergence

Algorithms for MDPs and Their Convergence MS&E338 Reinforcement Learning Lecture 2 - April 4 208 Algorithms for MDPs and Their Convergence Lecturer: Ben Van Roy Scribe: Matthew Creme and Kristen Kessel Bellman operators Recall from last lecture

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

An Empirical Dynamic Programming Algorithm for Continuous MDPs

An Empirical Dynamic Programming Algorithm for Continuous MDPs An Empirical Dynamic Programming Algorithm for Continuous MDPs William B. Haskell, Rahul Jain, Hiteshi harma and Pengqian Yu arxiv:709.07506v math.oc ep 07 Abstract We propose universal randomized function

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average

More information

Markovian Decision Process (MDP): theory and applications to wireless networks

Markovian Decision Process (MDP): theory and applications to wireless networks Markovian Decision Process (MDP): theory and applications to wireless networks Philippe Ciblat Joint work with I. Fawaz, N. Ksairi, C. Le Martret, M. Sarkiss Outline Examples MDP Applications A few examples

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach

More information

CS 4649/7649 Robot Intelligence: Planning

CS 4649/7649 Robot Intelligence: Planning CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Chapter 13 Wow! Least Squares Methods in Batch RL

Chapter 13 Wow! Least Squares Methods in Batch RL Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Gradient Estimation for Attractor Networks

Gradient Estimation for Attractor Networks Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

A Generalized Reduced Linear Program for Markov Decision Processes

A Generalized Reduced Linear Program for Markov Decision Processes Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence A Generalized Reduced Linear Program for Markov Decision Processes Chandrashekar Lakshminarayanan and Shalabh Bhatnagar Department

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

November 28 th, Carlos Guestrin 1. Lower dimensional projections

November 28 th, Carlos Guestrin 1. Lower dimensional projections PCA Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 1 Lower dimensional projections Rather than picking a subset of the features, we can new features that are

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Speedy Q-Learning. Abstract

Speedy Q-Learning. Abstract Speedy Q-Learning Mohammad Gheshlaghi Azar Radboud University Nijmegen Geert Grooteplein N, 6 EZ Nijmegen, Netherlands m.azar@science.ru.nl Mohammad Ghavamzadeh INRIA Lille, SequeL Project 40 avenue Halley

More information

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal

More information

Consistency of Fuzzy Model-Based Reinforcement Learning

Consistency of Fuzzy Model-Based Reinforcement Learning Consistency of Fuzzy Model-Based Reinforcement Learning Lucian Buşoniu, Damien Ernst, Bart De Schutter, and Robert Babuška Abstract Reinforcement learning (RL) is a widely used paradigm for learning control.

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies Kamyar Azizzadenesheli U.C. Irvine Joint work with Prof. Anima Anandkumar and Dr. Alessandro Lazaric. Motivation +1 Agent-Environment

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Maximum Margin Planning

Maximum Margin Planning Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement

More information

Q-Learning and Stochastic Approximation

Q-Learning and Stochastic Approximation MS&E338 Reinforcement Learning Lecture 4-04.11.018 Q-Learning and Stochastic Approximation Lecturer: Ben Van Roy Scribe: Christopher Lazarus Javier Sagastuy In this lecture we study the convergence of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

Markov Decision Processes and their Applications to Supply Chain Management

Markov Decision Processes and their Applications to Supply Chain Management Markov Decision Processes and their Applications to Supply Chain Management Jefferson Huang School of Operations Research & Information Engineering Cornell University June 24 & 25, 2018 10 th Operations

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

arxiv: v2 [cs.lg] 6 Sep 2011

arxiv: v2 [cs.lg] 6 Sep 2011 Journal of Machine Learning Research??? (???)??? Submitted???; Published??? Dynamic Policy Programming arxiv:1004.2027v2 [cs.lg] 6 Sep 2011 Mohammad Gheshlaghi Azar Vicenç Gómez Hilbert J. Kappen Department

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège Optimal sequential decision making for complex problems agents Damien Ernst University of Liège Email: dernst@uliege.be 1 About the class Regular lectures notes about various topics on the subject with

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information