Chapter 13 Wow! Least Squares Methods in Batch RL

Size: px
Start display at page:

Download "Chapter 13 Wow! Least Squares Methods in Batch RL"

Transcription

1 Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual minimization Least squares temporal difference learning Learning control Fitted Q-iteration Policy iteration

2 Batch RL Goal: Given the trajectory of the behavior policy π b X 1,A 1,R 1,, X t, A t, R t,, X N compute a good policy! Batch learning Properties: Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role Performance measures: V * (x) - V π (x) = sup x V * (x) - V π (x) = sup x V * (x) - V π (x) V * (x) - V π (x) 2 = (V * (x)-v π (x)) 2 dµ(x) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 2

3 Solution methods Build a model Do not build a model, but find an approximation to Q * using value iteration => fitted Q-iteration using policy iteration => Policy evaluated by approximate value iteration Policy evaluated by Bellman-residual minimization (BRM) Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI Policy search R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 3

4 Evaluating a policy: Fitted value iteration Choose a function space F. Solve for i=1,2,,m the LS (regression) problems: T Q i+1 =argmin Q F (R t + γq i (X t+1, π(x t+1 )) Q(X t, A t )) 2 t=1 Wait, what about the counterexample of Tsitsiklis and van Roy? Or the counterexample of Baird? When does this work?? Requirement: If M is big enough and the number of samples is big enough Q M should be close to Q π We have to make some assumptions on F.. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 4

5 Least-squares vs. gradient Linear least squares (ordinary regression): y t = w * T x t + ǫ t, (x t,y t ) jointly distributed r.v.s., iid, E[ǫ t x t ]=0. Seeing (x t,y t ), t=1,,t, find out w *. Loss function: L(w) = E[ (y 1 w T x 1 ) 2 ]. Least-squares approach: w T = argmin w t=1t (y t w T x t ) 2 Stochastic gradient method: w t+1 = w t + α t (y t -w t T x t ) x t Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is the computation? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 5

6 Fitted value iteration: Analysis Goal: Bound Q M - Q π µ2 in terms of max m ǫ m 2 ν, ǫ m 2 ν = ǫ m2 (x,a) ν(dx,da), where Q m+1 = T π Q m + ǫ m U m = Q m Q π U m+1 = Q m+1 Q π = T π Q m Q π + ǫ m = T π Q m T π Q π + ǫ m = γp π U m + ǫ m. U M = M m=0 (γp π ) M m ǫ m 1. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 6

7 U M = µ U M 2 Jensen M m=0 ( 1 1 γ Analysis/2 (γp π ) M m ǫ m 1. ) 2 1 γ 1 γ M+1 M m=0 ( ) γ C 1 1 γ 1 γ M+1 ( ) γ C 1 1 γ = C 1 ( 1 1 γ 1 γ M+1 γ m µ((p π ) m ǫ M m 1 ) 2 M m=0 ( γ m ν ǫ M m 1 2 γ M ν ǫ ) 2 ǫ 2 + C 1 γ M ν ǫ γ M+1. Jensen applied to operators, µ C 1 ν and: ρ: ρp π C 1 ν Legend: ρf = f(x)ρ(dx) (Pf)(x)= f(y)p(dy x) M m=0 γ m ǫ 2 ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 7

8 Summary If the regression errors are all small and the system is noisy ( π,ρ, ρ P π C 1 ν) then the final error will be small. How to make the regression errors small? Regression error decomposition: Estimation error Q m+1 T π Q m 2 Q m+1 Π F T π Q m 2 + Π F T π Q m T π Q m 2 Approximation error R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 8

9 Controlling the approximation error F TF Tf F f R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 9

10 Controlling the approximation error TF F F d p,µ (TF,F) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 10

11 Controlling the approximation error TF F TF F F R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 11

12 Controlling the approximation error Assume smoothness! Lip α (L) B(X, R max 1 γ ) T ( ) B(X, R max 1 γ ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 12

13 Learning with (lots of) historical data Data: A long trajectory of some exploration policy Goal: Efficient algorithm to learn a policy Idea: Use fitted action-values Algorithms: Bellman residual minimization, FQI [Antos et al. 06] LSPI [Lagoudakis, Parr 03] Bounds: Oracle inequalities (BRM, FQI and LSPI) consistency R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 13

14 BRM insight TD error: t =R t +γ Q(X t+1,π(x t+1 ))-Q(X t,a t ) Bellman error: E[E[ t X t,a t ] 2 ] What we can compute/estimate: E[E[ t2 X t,a t ]] They are different! However: E[ t X t, A t ] 2 =E[ 2 t X t, A t ] Var[ t X t, A t ] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 14

15 Loss function L N,π (Q, h)= 1 N { w t (R t + γq(x t+1, π(x t+1 )) Q(X t, A t )) 2 N t=1 } (R t + γq(x t+1, π(x t+1 )) h(x t, A t )) 2 w t =1/µ(A t X t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 15

16 Algorithm (BRM++) 1. Choose π 0, i:=0 2. While(i K)do: 3. Let Q i+1 =argmin Q F Asup h F A L N,πi (Q, h) 4. Let π i+1 (x)=argmax a A Q i+1 (x, a) 5. i:= i+1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 16

17 Do we need to reweight or throw away data? NO! WHY? Intuition from regression: m(x) = E[Y X=x] can be learnt no matter what p(x) is! π * (a x): the same should be possible! BUT.. Performance might be poor! => YES! Like in supervised learning when training and test distributions are different R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 17

18 Bound Q Q π K 2,ρ 2γ (1 γ) 2 C1/2 ρ,ν ) 1/2 (Ẽ(F)+E(F)+S N,x +(2γ K ) 1/2 R max, S N,x = c 2 ( )1+κ ( V 2 +1)ln(N)+ln(c 1)+ 1 1+κ ln(bc2 2 4 )+x 2κ (b 1/κ N) 1/2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 18

19 The concentration coefficients Lyapunov exponents Our case: y t+1 = P t y t ˆγ top = limsup t y t is infinite dimensional P t depends on the policy chosen If top-lyap exp.<=0, we are good 1 t log+ ( y t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 19

20 Open question Abstraction: f(i 1,..., i m )=log( P i1 P i2... P im ), i k {0,1}. Let f :{0,1} R +, f(x+y) f(x)+f(y), limsup m 1 m f([x] m) β. True? {y m } m, y m {0,1} m, limsup m 1 m log f(y m) β R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 20

21 Relation to LSTD LSTD: Linear function space Bootstrap the normal equation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 21

22 Conclusions Fitted RL algorithms work (for smooth MDPs), even with a single trajectory What to do about the curse of dimensionality? Need adaptive algorithms that can take advantage of regularity when present Penalized least-squares/aggregation? Feature relevance Factorization Manifold estimation Abstraction Special purpose algorithms? What priors to assume??? Is on-line learning easier? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 22

23 Reading/References M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4: ,2003. A. Antos, Cs. Szepesvári, R. Munos. Learning nearoptimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal (MLJ), to appear, 2007, available from: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 23

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Fitted Q-iteration in continuous action-space MDPs

Fitted Q-iteration in continuous action-space MDPs Fitted Q-iteration in continuous action-space MDPs András Antos Computer and Automation Research Inst of the Hungarian Academy of Sciences Kende u 13-17, Budapest 1111, Hungary antos@sztakihu Rémi Munos

More information

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path Machine Learning manuscript No. will be inserted by the editor Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path András Antos 1, Csaba

More information

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path Mach Learn 2008) 71: 89 129 DOI 10.1007/s10994-007-5038-2 Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path András Antos Csaba Szepesvári

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Regularities in Sequential Decision-Making Problems. Amir massoud Farahmand

Regularities in Sequential Decision-Making Problems. Amir massoud Farahmand Regularities in Sequential Decision-Making Problems Amir massoud Farahmand 1 Contents 1 Introduction 5 1.1 Agent Design as a Sequential Decision Making Problem... 5 1.2 Regularities and Adaptive Algorithms.............

More information

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Matthew W. Hoffman 1, Alessandro Lazaric 2, Mohammad Ghavamzadeh 2, and Rémi Munos 2 1 University of British

More information

Analysis of Classification-based Policy Iteration Algorithms

Analysis of Classification-based Policy Iteration Algorithms Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

The Fixed Points of Off-Policy TD

The Fixed Points of Off-Policy TD The Fixed Points of Off-Policy TD J. Zico Kolter Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 2139 kolter@csail.mit.edu Abstract Off-policy

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Classification-based Policy Iteration with a Critic

Classification-based Policy Iteration with a Critic Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Least squares temporal difference learning

Least squares temporal difference learning Least squares temporal difference learning TD(λ) Good properties of TD Easy to implement, traces achieve the forward view Linear complexity = fast Combines easily with linear function approximation Outperforms

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Kernel-Based Reinforcement Learning Using Bellman Residual Elimination

Kernel-Based Reinforcement Learning Using Bellman Residual Elimination Journal of Machine Learning Research () Submitted ; Published Kernel-Based Reinforcement Learning Using Bellman Residual Elimination Brett Bethke Department of Aeronautics and Astronautics Massachusetts

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Analysis of Classification-based Policy Iteration Algorithms. Abstract

Analysis of Classification-based Policy Iteration Algorithms. Abstract Analysis of Classification-based Policy Iteration Algorithms Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric Mohammad Ghavamzadeh Rémi Munos INRIA Lille - Nord Europe, Team

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

Analysis of a Classification-based Policy Iteration Algorithm

Analysis of a Classification-based Policy Iteration Algorithm Alessandro Lazaric alessandro.lazaric@inria.fr Mohammad Ghavamzadeh mohammad.ghavamzadeh@inria.fr Rémi Munos remi.munos@inria.fr SequeL Project, IRIA Lille-ord Europe, 40 avenue Halley, 59650 Villeneuve

More information

A Dantzig Selector Approach to Temporal Difference Learning

A Dantzig Selector Approach to Temporal Difference Learning Matthieu Geist Supélec, IMS Research Group, Metz, France Bruno Scherrer INRIA, MAIA Project Team, Nancy, France Alessandro Lazaric and Mohammad Ghavamzadeh INRIA Lille - Team SequeL, France matthieu.geist@supelec.fr

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence

More information

Generalized Classification-based Approximate Policy Iteration

Generalized Classification-based Approximate Policy Iteration JMLR: Workshop and Conference Proceedings vol (2012) 1 11 European Workshop on Reinforcement Learning Generalized Classification-based Approximate Policy Iteration Amir-massoud Farahmand and Doina Precup

More information

Approximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression

Approximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression Approximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression The MIT Faculty has made this article openly available. Please share how this access benefits you. Your

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Classification-based Policy Iteration with a Critic

Classification-based Policy Iteration with a Critic Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of

More information

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path András Antos 1, Csaba Szepesvári 1,andRémi Munos 2 1 Computer and Automation Research

More information

Basis Adaptation for Sparse Nonlinear Reinforcement Learning

Basis Adaptation for Sparse Nonlinear Reinforcement Learning Basis Adaptation for Sparse Nonlinear Reinforcement Learning Sridhar Mahadevan, Stephen Giguere, and Nicholas Jacek School of Computer Science University of Massachusetts, Amherst mahadeva@cs.umass.edu,

More information

Regularization and Feature Selection in. the Least-Squares Temporal Difference

Regularization and Feature Selection in. the Least-Squares Temporal Difference Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Least-Squares Methods for Policy Iteration

Least-Squares Methods for Policy Iteration Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with

More information

Model Selection in Reinforcement Learning

Model Selection in Reinforcement Learning Machine Learning manuscript No. will be inserted by the editor Model Selection in Reinforcement Learning Amir-massoud Farahmand 1, Csaba Szepesvári 1 Department of Computing Science University of Alberta

More information

Analyzing Feature Generation for Value-Function Approximation

Analyzing Feature Generation for Value-Function Approximation Ronald Parr Christopher Painter-Wakefield Department of Computer Science, Duke University, Durham, NC 778 USA Lihong Li Michael Littman Department of Computer Science, Rutgers University, Piscataway, NJ

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Regularization and Feature Selection in. the Least-Squares Temporal Difference

Regularization and Feature Selection in. the Least-Squares Temporal Difference Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car) Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999 Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation)

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

arxiv: v1 [cs.lg] 30 Dec 2016

arxiv: v1 [cs.lg] 30 Dec 2016 Adaptive Least-Squares Temporal Difference Learning Timothy A. Mann and Hugo Penedones and Todd Hester Google DeepMind London, United Kingdom {kingtim, hugopen, toddhester}@google.com Shie Mannor Electrical

More information

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation Hamid R. Maei University of Alberta Edmonton, AB, Canada Csaba Szepesvári University of Alberta Edmonton, AB, Canada

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning abstract

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Laplacian Agent Learning: Representation Policy Iteration

Laplacian Agent Learning: Representation Policy Iteration Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)

More information

Policy Evaluation with Temporal Differences: A Survey and Comparison

Policy Evaluation with Temporal Differences: A Survey and Comparison Journal of Machine Learning Research 15 (2014) 809-883 Submitted 5/13; Revised 11/13; Published 3/14 Policy Evaluation with Temporal Differences: A Survey and Comparison Christoph Dann Gerhard Neumann

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Value Function Approximation in Zero-Sum Markov Games

Value Function Approximation in Zero-Sum Markov Games Value Function Approximation in Zero-Sum Markov Games Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 {mgl,parr}@cs.duke.edu Abstract This paper investigates

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Batch mode reinforcement learning based on the synthesis of artificial trajectories

Batch mode reinforcement learning based on the synthesis of artificial trajectories DOI 10.1007/s10479-012-1248-5 Batch mode reinforcement learning based on the synthesis of artificial trajectories Raphael Fonteneau Susan A. Murphy Louis Wehenkel Damien Ernst Springer Science+Business

More information

Model-based Reinforcement Learning with State and Action Abstractions. Hengshuai Yao

Model-based Reinforcement Learning with State and Action Abstractions. Hengshuai Yao Model-based Reinforcement Learning with State and Action Abstractions by Hengshuai Yao A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of

More information

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, 3 David Silver, Csaba Szepesvári,

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

arxiv: v1 [cs.lg] 6 Jun 2013

arxiv: v1 [cs.lg] 6 Jun 2013 Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee Bruno Scherrer, LORIA MAIA project-team, Nancy, France, bruno.scherrer@inria.fr Matthieu Geist, Supélec IMS-MaLIS Research Group,

More information

Toward Off-Policy Learning Control with Function Approximation

Toward Off-Policy Learning Control with Function Approximation Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, Richard S. Sutton Department of Computing Science, University of Alberta, Edmonton, Canada T6G 2E8 Department of Computer Science and Automation, Indian

More information

Payments System Design Using Reinforcement Learning: A Progress Report

Payments System Design Using Reinforcement Learning: A Progress Report Payments System Design Using Reinforcement Learning: A Progress Report A. Desai 1 H. Du 1 R. Garratt 2 F. Rivadeneyra 1 1 Bank of Canada 2 University of California Santa Barbara 16th Payment and Settlement

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Online least-squares policy iteration for reinforcement learning control

Online least-squares policy iteration for reinforcement learning control 2 American Control Conference Marriott Waterfront, Baltimore, MD, USA June 3-July 2, 2 WeA4.2 Online least-squares policy iteration for reinforcement learning control Lucian Buşoniu, Damien Ernst, Bart

More information