Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Size: px
Start display at page:

Download "Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019"

Transcription

1 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas Lecture 8 Bertsekas Reinforcement Learning 1 / 21

2 Outline 1 Review of Infinite Horizon Problems 2 Exact Policy Iteration 3 Approximations with Policy Iteration Bertsekas Reinforcement Learning 2 / 21

3 x 0 Stochastic DP Problems Random Transition x k+1 = f(x k,u k,w k )... x k x k+1 Random Cost k g(x k,u k,w k ) Infinite Horizon... Infinite number of stages, and stationary system and cost System x k+1 = f (x k, u k, w k ) with state, control, and random disturbance. Policies π = {µ 0, µ 1,...} with µ k (x) U(x) for all x and k. Special scalar α with 0 < α 1. If α < 1 the problem is called discounted. Cost of stage k: α k g ( x k, µ k (x k ), w k ). Cost of a policy J π(x 0 ) = lim N Ew k { N 1 α k g ( ) } x k, µ k (x k ), w k k=0 Optimal cost function J (x 0 ) = min π J π(x 0 ) If α = 1 we assume a special cost-free termination state t. The objective is to reach t at minimum expected cost. The problem is called stochastic shortest path (SSP) problem. Bertsekas Reinforcement Learning 4 / 21

4 Main Results - Finite-State Notation - Discounted Problems Convergence of VI Given any initial conditions J 0 (1),..., J 0 (n), the sequence { J k (i) } generated by VI J k+1 (i) = min u U(i) converges to J (i) for each i. Bellman s equation n p ij (u) ( g(i, u, j) + αj k (j) ), i = 1,..., n, The optimal cost function J = ( J (1),..., J (n) ) satisfies the equation J (i) = min u U(i) n p ij (u) ( g(i, u, j) + αj (j) ), i = 1,..., n, and is the unique solution of this equation. Optimality condition A stationary policy µ is optimal if and only if for every state i, µ(i) attains the minimum in the Bellman equation. Bertsekas Reinforcement Learning 5 / 21

5 Additional Results: Bellman Equation and Value Iteration for Policies Fix a policy µ with cost function J µ. Change the problem so the only control available at i is just µ(i) [not the set U(i)]. Apply our Bellman equation and VI convergence results: The VI algorithm (for policy µ), n ( ) ( J k+1 (i) = p ij µ(i) g ( i, µ(i), j ) ) + αj k (j), i = 1,..., n, converges to the cost J µ(i) for each i, for any initial conditions J 0 (1),..., J 0 (n). J µ is the unique solution of the Bellman equation (of policy µ) n ( ) ( J µ(i) = p ij µ(i) g ( i, µ(i), j ) ) + αj µ(j), i = 1,..., n Solving this linear system of n equations with n unknowns, the costs J µ(i), is called evaluation of policy µ. Evaluation of µ can be done by exact solution of the Bellman equation (e.g., Gaussian elimination), or iteratively with the VI algorithm (most likely for large n). Similar results hold for SSP problems. Bertsekas Reinforcement Learning 6 / 21

6 We introduce the DP operators (T µj)(i) = n i=1 (TJ)(i) = min u U(i) Shorthand Notation ( ) ( p ij µ(i) g ( i, µ(i), j ) ) + αj(j), i = 1,..., n, n p ij (u) ( g(i, u, j) + αj(j) ), i = 1,..., n They provide convenience of notation AND a vehicle for unification. T µ and T form the mathematical signature" of a DP problem, and serve to unify the DP theory (extensions to minimax, games, infinite spaces problems, etc). Their critical property is monotonicity (as J increases so does T µj and TJ); see the Abstract DP" book (DPB, 2018). All the DP results/algorithms can be written in math shorthand using T and T µ VI algorithm: J k+1 = TJ k, J k+1 = T µj k, k = 0, 1,... Bellman equation: J = TJ, J µ = T µj µ. µ is optimal if and only if TJ = T µj. Bertsekas Reinforcement Learning 7 / 21

7 Contraction Property of T and T µ n ( ) ( (T µj)(i) = p ij µ(i) g ( i, µ(i), j ) ) + αj(j), i = 1,..., n, i=1 n (TJ)(i) = min p ij (u) ( g(i, u, j) + αj(j) ), i = 1,..., n u U(i) In our discounted and SSP problems, T and T µ are contractions Introduce a (weighted max) norm for the vectors J = ( J(1),..., J(n) ) : J(i) J = max i=1,...,n v(i), where v(1),..., v(n) are some positive scalars. Definition: A mapping H that maps J = ( J(1),..., J(n) ) to the vector HJ = ( (HJ)(1),..., (HJ)(n) ) is a contraction if for some ρ with 0 < ρ < 1 HJ HJ ρ J J, for all J, J For our discounted and SSP problems, under our assumptions, T and T µ are contractions (in addition to being monotone). For the discounted problem, ρ = α and v(i) 1. This is the mathematical reason why our problems are so nice! Bertsekas Reinforcement Learning 8 / 21

8 Policy Iteration (PI) Algorithm: Generates a Sequence of Policies {µ k } Initial Policy Evaluate Cost Function J µ of Current policy µ Policy Cost Evaluation Generate Improved Policy µ Policy Improvement Given the current policy µ k, a PI consists of two phases: Policy evaluation computes J µ k (i), i = 1,..., n, as the solution of the (linear) Bellman equation system n ( J µ k (i) = p ij µ k (i) )( g ( i, µ k (i), j ) ) + αj µ k (j), i = 1,..., n Policy improvement then computes a new policy µ k+1 as n µ k+1 (i) arg min p ij (u) ( g(i, u, j) + αj µ k (j) ), u U(i) Compactly (in shorthand): PI is written as T µ k+1j µ k = TJ µ k. i = 1,..., n Bertsekas Reinforcement Learning 10 / 21

9 Proof of Policy Improvement Property PI finite-step convergence: PI generates an improving sequence of policies, i.e., J µ k+1(i) J µ k (i) for all i and k, and terminates with an optimal policy. We will show that J µ J µ, where µ is obtained from µ by PI Denote by J N the cost function of a policy that applies µ for the first N stages and applies µ thereafter. We have the Bellman equation J µ(i) = n p ) ij( ( µ(i) g ( i, µ(i), j ) ) + αj µ(j), so n ( ) ( J 1 (i) = p ij µ(i) g ( i, µ(i), j ) ) + αj µ(j) J µ(i) (by policy improvement eq.) From the definition of J 2 and J 1, monotonicity, and the preceding relation, we have n ( ) ( J 2 (i) = p ij µ(i) g ( i, µ(i), j ) ) n ( ) ( +αj 1 (j) p ij µ(i) g ( i, µ(i), j ) ) +αj µ(j) = J 1 (i) so J 2 (i) J 1 (i) J µ(i) for all i. Continuing similarly, we obtain J N+1 (i) J N (i) J µ(i) for all i and N. Since J N J µ (VI for µ converges), it follows that J µ J µ. Bertsekas Reinforcement Learning 11 / 21

10 Optimistic PI: Like Standard PI, but Policy Evaluation is Approximate, and Based on a Finite Number of VI Generates sequences of cost function approximations {J k } and policies {µ k } Given the typical function J k : Policy improvement computes a policy µ k such that µ k (i) arg min u U(i) n p ij (u) ( g(i, u, j) + αj k (j) ), i = 1,..., n Optimistic policy evaluation starts with Ĵk,0 = J k, and uses m k VI iterations for policy µ k to compute Ĵk,1,..., Ĵk,m k according to Ĵ k,m+1 (i) = n ( p ij (µ k (i)) g ( i, µ k (i), j ) ) + αĵk,m(j) for all i = 1,..., n, m = 0,..., m k 1, and sets J k+1 = Ĵk,m k. Convergence (using a cost improvement argument similar to standard PI) For the optimistic PI algorithm, we have J k J and J µ k J. Bertsekas Reinforcement Learning 12 / 21

11 Multistep Policy Iteration: Policy Improvement with Multistep Lookahead Motivation: It may yield a better policy µ k+1 than with one-step lookahead, at the expense of a more complex policy improvement operation. Given the typical policy µ k : Policy evaluation computes J µ k (i), i = 1,..., n, as the solution of the (linear) system of Bellman equations J µ k (i) = n ( p ij µ k (i) )( g ( i, µ k (i), j ) ) + αj µ k (j), i = 1,..., n Policy improvement with l-step lookahead then solves the l-stage problem with terminal cost function J µ k. If {ˆµ 0,..., ˆµ l 1 } is the optimal policy of this problem, then the new policy µ k+1 is ˆµ 0. Convergence (using similar argument to standard PI) Exact multistep PI has the same solid convergence properties as its one-step lookahead counterpart. Bertsekas Reinforcement Learning 13 / 21

12 Policy Iteration for Q-Factors (Can be Used in Model-Free/Monte Carlo Contexts) Initial Policy Evaluate Q-Factor Q µ of Current policy µ Policy Q-Factor Evaluation Generate Improved Policy µ Policy Improvement Given the typical policy µ k : Policy evaluation computes Q µ k (i, u), for all i = 1,..., n, and u U(i), as the solution of the (linear) system of equations n ( ( Q µ k (i, u) = p ij (u) g(i, u, j) + αq µ k j, µ k (j) )) Policy improvement then computes a new policy µ k+1 as µ k+1 (i) arg min Q µ k (i, u), i u U(i) = 1,..., n Bertsekas Reinforcement Learning 14 / 21

13 A Working Break: Think About Approximate PI Initial Policy Evaluate Cost Function J µ of Current policy µ Policy Cost Evaluation Generate Improved Policy µ Policy Improvement How would you introduce approximations into PI? What would make sense for: Approximation in policy evaluation? Approximation in policy improvement? Give examples (problem approximation, rollout, MPC, neural nets...) Bertsekas Reinforcement Learning 16 / 21

14 Approximation in Value Space for Infinite Horizon Problems Approximate minimization First Step Future n min u U(i) pij(u)( g(i, u, j)+ J(j) ) Approximations: Replace E{ } with nominal values (certainty equivalence) Adaptive simulation Monte Carlo tree search Computation of J: Problem approximation Rollout Approximate 2 PI Parametric approximation Aggregation We will focus on rollout, and particularly on approximate PI schemes, which operate as follows: Several policies µ 0, µ 1,..., µ m are generated, starting with an initial policy µ 0. Each policy µ k is evaluated approximately, with a cost function J µ k, often with the use of a parametric approximation/neural network approach. The next policy µ k+1 is generated by policy improvement based on J µ k. The approximate evaluation J µ m of the last policy in the sequence is used as the lookahead approximation J in a one-step or multistep lookahead minimization. Bertsekas Reinforcement Learning 17 / 21

15 Rollout The pure form of rollout : Approximation in value space with J = J µ µ is called the base policy, and is usually evaluated by Monte-Carlo. The rollout policy is the result of a single policy improvement using µ. So the rollout policy improves over the base policy. Lookahead Tree Terminal Cost Approximation J ik Selective Depth Rollout Policy µ A States ik+1 States ik+2 Variants of rollout (l-step lookahead, truncated rollout, terminal cost approx) l-step lookahead, then rollout with policy µ for a limited number of steps, and finally a terminal cost approximation. This is a single optimistic policy iteration combined with multistep lookahead. Bertsekas Reinforcement Learning 18 / 21

16 Approximate (Nonoptimistic) Policy Iteration - Error Bound - Stability J µ k Error Zone J Width ( +2 )/(1 ) 2 PI index k Assume an approximate policy evaluation error satisfying max Jµ k (i) J µ k (i) δ i=1,...,n and an approximate policy improvement error satisfying n ( max p ij µ k+1 (i) )( g(i, µ k+1 (i), j) + α J µ k (j) ) i=1,...,n min u U(i) n p ij (u) ( g(i, u, j) + α J µ k (j) ) ɛ Bertsekas Reinforcement Learning 19 / 21

17 Error Bound for the Case Where Policies Converge J µ k J Error Zone Width ( +2 )/(1 ) PI index k A better error bound (by a factor 1 α) holds if the generated policy sequence {µ k } converges to some policy. Convergence of policies is guaranteed in some cases; approximate PI using aggregation is one of them. Bertsekas Reinforcement Learning 20 / 21

18 About the Next Lecture We will cover: PI with parametric approximation methods Linear programming approach Q-learning Additional methods; temporal differences PLEASE READ AS MUCH OF SECTIONS AS YOU CAN PLEASE DOWNLOAD THE LATEST VERSIONS FROM MY WEBSITE Bertsekas Reinforcement Learning 21 / 21

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,

More information

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

Infinite-Horizon Dynamic Programming

Infinite-Horizon Dynamic Programming 1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业

More information

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition:

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

Regular Policies in Abstract Dynamic Programming

Regular Policies in Abstract Dynamic Programming August 2016 (Revised January 2017) Report LIDS-P-3173 Regular Policies in Abstract Dynamic Programming Dimitri P. Bertsekas Abstract We consider challenging dynamic programming models where the associated

More information

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract

More information

APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS

APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS Based on the books: (1) Neuro-Dynamic Programming, by DPB and J. N. Tsitsiklis, Athena Scientific,

More information

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE Rollout algorithms Policy improvement property Discrete deterministic problems Approximations of rollout algorithms Model Predictive Control (MPC) Discretization

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming REPORT LIDS-3204, JUNE 2016 (REVISED, NOVEMBER 2017) 1 Affine Monotonic and Risk-Sensitive Models in Dynamic Programming Dimitri P. Bertsekas arxiv:1608.01393v2 [math.oc] 28 Nov 2017 Abstract In this paper

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Massachusetts Institute of Technology WWW site for book information and orders http://www.athenasc.com Athena Scientific, Belmont, Massachusetts Athena

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Q-learning and policy iteration algorithms for stochastic shortest path problems

Q-learning and policy iteration algorithms for stochastic shortest path problems Q-learning and policy iteration algorithms for stochastic shortest path problems The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Lecture 4: Misc. Topics and Reinforcement Learning

Lecture 4: Misc. Topics and Reinforcement Learning Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction

More information

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration

Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration September 010 Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration Dimitri P. Bertsekas Abstract A counterexample due to Williams and Baird [WiB93] (Example in their paper) is transcribed

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of

More information

Approximate active fault detection and control

Approximate active fault detection and control Approximate active fault detection and control Jan Škach Ivo Punčochář Miroslav Šimandl Department of Cybernetics Faculty of Applied Sciences University of West Bohemia Pilsen, Czech Republic 11th European

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming April 2010 (Revised October 2010) Report LIDS - 2831 Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming Dimitri P. Bertseas 1 and Huizhen Yu 2 Abstract We consider the classical

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

THIRD EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 10/1/2008. Athena Scientific, Belmont, Mass.

THIRD EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 10/1/2008. Athena Scientific, Belmont, Mass. Dynamic Programming and Optimal Control THIRD EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology Selected Theoretical Problem Solutions Last Updated 10/1/2008 Athena Scientific, Belmont,

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège Optimal sequential decision making for complex problems agents Damien Ernst University of Liège Email: dernst@uliege.be 1 About the class Regular lectures notes about various topics on the subject with

More information

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming David Laibson 9/04/2014 Outline: 1. Functional operators 2. Iterative solutions for the Bellman Equation 3. Contraction Mapping Theorem

More information

MATH4406 Assignment 5

MATH4406 Assignment 5 MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

ECE7850 Lecture 7. Discrete Time Optimal Control and Dynamic Programming

ECE7850 Lecture 7. Discrete Time Optimal Control and Dynamic Programming ECE7850 Lecture 7 Discrete Time Optimal Control and Dynamic Programming Discrete Time Optimal control Problems Short Introduction to Dynamic Programming Connection to Stabilization Problems 1 DT nonlinear

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Lecture Notes 10: Dynamic Programming

Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists Peter J. Hammond 1 of 81 Lecture Notes 10: Dynamic Programming Peter J. Hammond 2018 September 28th University of Warwick, EC9A0 Maths for Economists Peter

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

APPENDIX C: Measure Theoretic Issues

APPENDIX C: Measure Theoretic Issues APPENDIX C: Measure Theoretic Issues A general theory of stochastic dynamic programming must deal with the formidable mathematical questions that arise from the presence of uncountable probability spaces.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Temporal-Difference Q-learning in Active Fault Diagnosis

Temporal-Difference Q-learning in Active Fault Diagnosis Temporal-Difference Q-learning in Active Fault Diagnosis Jan Škach 1 Ivo Punčochář 1 Frank L. Lewis 2 1 Identification and Decision Making Research Group (IDM) European Centre of Excellence - NTIS University

More information

CS 234 Midterm - Winter

CS 234 Midterm - Winter CS 234 Midterm - Winter 2017-18 **Do not turn this page until you are instructed to do so. Instructions Please answer the following questions to the best of your ability. Read all the questions first before

More information

Approximate Policy Iteration: A Survey and Some New Methods

Approximate Policy Iteration: A Survey and Some New Methods April 2010 - Revised December 2010 and June 2011 Report LIDS - 2833 A version appears in Journal of Control Theory and Applications, 2011 Approximate Policy Iteration: A Survey and Some New Methods Dimitri

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Understanding (Exact) Dynamic Programming through Bellman Operators

Understanding (Exact) Dynamic Programming through Bellman Operators Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11 Overview 1 Value

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,

More information

A Least Squares Q-Learning Algorithm for Optimal Stopping Problems

A Least Squares Q-Learning Algorithm for Optimal Stopping Problems LIDS REPORT 273 December 2006 Revised: June 2007 A Least Squares Q-Learning Algorithm for Optimal Stopping Problems Huizhen Yu janey.yu@cs.helsinki.fi Dimitri P. Bertsekas dimitrib@mit.edu Abstract We

More information

Average-cost temporal difference learning and adaptive control variates

Average-cost temporal difference learning and adaptive control variates Average-cost temporal difference learning and adaptive control variates Sean Meyn Department of ECE and the Coordinated Science Laboratory Joint work with S. Mannor, McGill V. Tadic, Sheffield S. Henderson,

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Basic Deterministic Dynamic Programming

Basic Deterministic Dynamic Programming Basic Deterministic Dynamic Programming Timothy Kam School of Economics & CAMA Australian National University ECON8022, This version March 17, 2008 Motivation What do we do? Outline Deterministic IHDP

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

9 Improved Temporal Difference

9 Improved Temporal Difference 9 Improved Temporal Difference Methods with Linear Function Approximation DIMITRI P. BERTSEKAS and ANGELIA NEDICH Massachusetts Institute of Technology Alphatech, Inc. VIVEK S. BORKAR Tata Institute of

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

CS885 Reinforcement Learning Lecture 7a: May 23, 2018 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

arxiv: v2 [cs.sy] 29 Mar 2016

arxiv: v2 [cs.sy] 29 Mar 2016 Approximate Dynamic Programming: a Q-Function Approach Paul Beuchat, Angelos Georghiou and John Lygeros 1 ariv:1602.07273v2 [cs.sy] 29 Mar 2016 Abstract In this paper we study both the value function and

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information