Chapter 13 Wow! Least Squares Methods in Batch RL
|
|
- Marshall Strickland
- 6 years ago
- Views:
Transcription
1 Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual minimization Least squares temporal difference learning Learning control Fitted Q-iteration Policy iteration
2 Batch RL Goal: Given the trajectory of the behavior policy π b X 1,A 1,R 1,, X t, A t, R t,, X N compute a good policy! Batch learning Properties: Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role Performance measures: V * (x) - V π (x) = sup x V * (x) - V π (x) = sup x V * (x) - V π (x) V * (x) - V π (x) 2 = (V * (x)-v π (x)) 2 dµ(x) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 2
3 Solution methods Build a model Do not build a model, but find an approximation to Q * using value iteration => fitted Q-iteration using policy iteration => Policy evaluated by approximate value iteration Policy evaluated by Bellman-residual minimization (BRM) Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI Policy search R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 3
4 Evaluating a policy: Fitted value iteration Choose a function space F. Solve for i=1,2,,m the LS (regression) problems: T Q i+1 =argmin Q F (R t + γq i (X t+1, π(x t+1 )) Q(X t, A t )) 2 t=1 Wait, what about the counterexample of Tsitsiklis and van Roy? Or the counterexample of Baird? When does this work?? Requirement: If M is big enough and the number of samples is big enough Q M should be close to Q π We have to make some assumptions on F.. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 4
5 Least-squares vs. gradient Linear least squares (ordinary regression): y t = w * T x t + ǫ t, (x t,y t ) jointly distributed r.v.s., iid, E[ǫ t x t ]=0. Seeing (x t,y t ), t=1,,t, find out w *. Loss function: L(w) = E[ (y 1 w T x 1 ) 2 ]. Least-squares approach: w T = argmin w t=1t (y t w T x t ) 2 Stochastic gradient method: w t+1 = w t + α t (y t -w t T x t ) x t Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is the computation? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 5
6 Fitted value iteration: Analysis Goal: Bound Q M - Q π µ2 in terms of max m ǫ m 2 ν, ǫ m 2 ν = ǫ m2 (x,a) ν(dx,da), where Q m+1 = T π Q m + ǫ m U m = Q m Q π U m+1 = Q m+1 Q π = T π Q m Q π + ǫ m = T π Q m T π Q π + ǫ m = γp π U m + ǫ m. U M = M m=0 (γp π ) M m ǫ m 1. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 6
7 U M = µ U M 2 Jensen M m=0 ( 1 1 γ Analysis/2 (γp π ) M m ǫ m 1. ) 2 1 γ 1 γ M+1 M m=0 ( ) γ C 1 1 γ 1 γ M+1 ( ) γ C 1 1 γ = C 1 ( 1 1 γ 1 γ M+1 γ m µ((p π ) m ǫ M m 1 ) 2 M m=0 ( γ m ν ǫ M m 1 2 γ M ν ǫ ) 2 ǫ 2 + C 1 γ M ν ǫ γ M+1. Jensen applied to operators, µ C 1 ν and: ρ: ρp π C 1 ν Legend: ρf = f(x)ρ(dx) (Pf)(x)= f(y)p(dy x) M m=0 γ m ǫ 2 ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 7
8 Summary If the regression errors are all small and the system is noisy ( π,ρ, ρ P π C 1 ν) then the final error will be small. How to make the regression errors small? Regression error decomposition: Estimation error Q m+1 T π Q m 2 Q m+1 Π F T π Q m 2 + Π F T π Q m T π Q m 2 Approximation error R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 8
9 Controlling the approximation error F TF Tf F f R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 9
10 Controlling the approximation error TF F F d p,µ (TF,F) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 10
11 Controlling the approximation error TF F TF F F R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 11
12 Controlling the approximation error Assume smoothness! Lip α (L) B(X, R max 1 γ ) T ( ) B(X, R max 1 γ ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 12
13 Learning with (lots of) historical data Data: A long trajectory of some exploration policy Goal: Efficient algorithm to learn a policy Idea: Use fitted action-values Algorithms: Bellman residual minimization, FQI [Antos et al. 06] LSPI [Lagoudakis, Parr 03] Bounds: Oracle inequalities (BRM, FQI and LSPI) consistency R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 13
14 BRM insight TD error: t =R t +γ Q(X t+1,π(x t+1 ))-Q(X t,a t ) Bellman error: E[E[ t X t,a t ] 2 ] What we can compute/estimate: E[E[ t2 X t,a t ]] They are different! However: E[ t X t, A t ] 2 =E[ 2 t X t, A t ] Var[ t X t, A t ] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 14
15 Loss function L N,π (Q, h)= 1 N { w t (R t + γq(x t+1, π(x t+1 )) Q(X t, A t )) 2 N t=1 } (R t + γq(x t+1, π(x t+1 )) h(x t, A t )) 2 w t =1/µ(A t X t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 15
16 Algorithm (BRM++) 1. Choose π 0, i:=0 2. While(i K)do: 3. Let Q i+1 =argmin Q F Asup h F A L N,πi (Q, h) 4. Let π i+1 (x)=argmax a A Q i+1 (x, a) 5. i:= i+1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 16
17 Do we need to reweight or throw away data? NO! WHY? Intuition from regression: m(x) = E[Y X=x] can be learnt no matter what p(x) is! π * (a x): the same should be possible! BUT.. Performance might be poor! => YES! Like in supervised learning when training and test distributions are different R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 17
18 Bound Q Q π K 2,ρ 2γ (1 γ) 2 C1/2 ρ,ν ) 1/2 (Ẽ(F)+E(F)+S N,x +(2γ K ) 1/2 R max, S N,x = c 2 ( )1+κ ( V 2 +1)ln(N)+ln(c 1)+ 1 1+κ ln(bc2 2 4 )+x 2κ (b 1/κ N) 1/2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 18
19 The concentration coefficients Lyapunov exponents Our case: y t+1 = P t y t ˆγ top = limsup t y t is infinite dimensional P t depends on the policy chosen If top-lyap exp.<=0, we are good 1 t log+ ( y t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 19
20 Open question Abstraction: f(i 1,..., i m )=log( P i1 P i2... P im ), i k {0,1}. Let f :{0,1} R +, f(x+y) f(x)+f(y), limsup m 1 m f([x] m) β. True? {y m } m, y m {0,1} m, limsup m 1 m log f(y m) β R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 20
21 Relation to LSTD LSTD: Linear function space Bootstrap the normal equation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 21
22 Conclusions Fitted RL algorithms work (for smooth MDPs), even with a single trajectory What to do about the curse of dimensionality? Need adaptive algorithms that can take advantage of regularity when present Penalized least-squares/aggregation? Feature relevance Factorization Manifold estimation Abstraction Special purpose algorithms? What priors to assume??? Is on-line learning easier? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 22
23 Reading/References M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4: ,2003. A. Antos, Cs. Szepesvári, R. Munos. Learning nearoptimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal (MLJ), to appear, 2007, available from: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 23
Finite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationChapter 8: Generalization and Function Approximation
Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview
More informationReinforcement Learning
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien
More informationFitted Q-iteration in continuous action-space MDPs
Fitted Q-iteration in continuous action-space MDPs András Antos Computer and Automation Research Inst of the Hungarian Academy of Sciences Kende u 13-17, Budapest 1111, Hungary antos@sztakihu Rémi Munos
More informationLearning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
Machine Learning manuscript No. will be inserted by the editor Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path András Antos 1, Csaba
More informationLearning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
Mach Learn 2008) 71: 89 129 DOI 10.1007/s10994-007-5038-2 Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path András Antos Csaba Szepesvári
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationCS599 Lecture 2 Function Approximation in RL
CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)
More informationRegularities in Sequential Decision-Making Problems. Amir massoud Farahmand
Regularities in Sequential Decision-Making Problems Amir massoud Farahmand 1 Contents 1 Introduction 5 1.1 Agent Design as a Sequential Decision Making Problem... 5 1.2 Regularities and Adaptive Algorithms.............
More informationRegularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization
Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Matthew W. Hoffman 1, Alessandro Lazaric 2, Mohammad Ghavamzadeh 2, and Rémi Munos 2 1 University of British
More informationAnalysis of Classification-based Policy Iteration Algorithms
Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationThe Fixed Points of Off-Policy TD
The Fixed Points of Off-Policy TD J. Zico Kolter Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 2139 kolter@csail.mit.edu Abstract Off-policy
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More informationClassification-based Policy Iteration with a Critic
Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationLeast squares temporal difference learning
Least squares temporal difference learning TD(λ) Good properties of TD Easy to implement, traces achieve the forward view Linear complexity = fast Combines easily with linear function approximation Outperforms
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationKernel-Based Reinforcement Learning Using Bellman Residual Elimination
Journal of Machine Learning Research () Submitted ; Published Kernel-Based Reinforcement Learning Using Bellman Residual Elimination Brett Bethke Department of Aeronautics and Astronautics Massachusetts
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationAnalysis of Classification-based Policy Iteration Algorithms. Abstract
Analysis of Classification-based Policy Iteration Algorithms Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric Mohammad Ghavamzadeh Rémi Munos INRIA Lille - Nord Europe, Team
More informationLeast-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems
: Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr
More informationAnalysis of a Classification-based Policy Iteration Algorithm
Alessandro Lazaric alessandro.lazaric@inria.fr Mohammad Ghavamzadeh mohammad.ghavamzadeh@inria.fr Rémi Munos remi.munos@inria.fr SequeL Project, IRIA Lille-ord Europe, 40 avenue Halley, 59650 Villeneuve
More informationA Dantzig Selector Approach to Temporal Difference Learning
Matthieu Geist Supélec, IMS Research Group, Metz, France Bruno Scherrer INRIA, MAIA Project Team, Nancy, France Alessandro Lazaric and Mohammad Ghavamzadeh INRIA Lille - Team SequeL, France matthieu.geist@supelec.fr
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More informationA Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation
A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence
More informationGeneralized Classification-based Approximate Policy Iteration
JMLR: Workshop and Conference Proceedings vol (2012) 1 11 European Workshop on Reinforcement Learning Generalized Classification-based Approximate Policy Iteration Amir-massoud Farahmand and Doina Precup
More informationApproximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression
Approximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression The MIT Faculty has made this article openly available. Please share how this access benefits you. Your
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationAn online kernel-based clustering approach for value function approximation
An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationClassification-based Policy Iteration with a Critic
Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of
More informationLearning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path
Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path András Antos 1, Csaba Szepesvári 1,andRémi Munos 2 1 Computer and Automation Research
More informationBasis Adaptation for Sparse Nonlinear Reinforcement Learning
Basis Adaptation for Sparse Nonlinear Reinforcement Learning Sridhar Mahadevan, Stephen Giguere, and Nicholas Jacek School of Computer Science University of Massachusetts, Amherst mahadeva@cs.umass.edu,
More informationRegularization and Feature Selection in. the Least-Squares Temporal Difference
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,
More informationLinear Least-squares Dyna-style Planning
Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for
More informationLeast-Squares Methods for Policy Iteration
Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with
More informationModel Selection in Reinforcement Learning
Machine Learning manuscript No. will be inserted by the editor Model Selection in Reinforcement Learning Amir-massoud Farahmand 1, Csaba Szepesvári 1 Department of Computing Science University of Alberta
More informationAnalyzing Feature Generation for Value-Function Approximation
Ronald Parr Christopher Painter-Wakefield Department of Computer Science, Duke University, Durham, NC 778 USA Lihong Li Michael Littman Department of Computer Science, Rutgers University, Piscataway, NJ
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationINTRODUCTION TO MARKOV DECISION PROCESSES
INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationRegularization and Feature Selection in. the Least-Squares Temporal Difference
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationOverview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)
Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999 Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation)
More informationValue Function Based Reinforcement Learning in Changing Markovian Environments
Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationJun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544
Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department
More informationModel-Based Reinforcement Learning with Continuous States and Actions
Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationarxiv: v1 [cs.lg] 30 Dec 2016
Adaptive Least-Squares Temporal Difference Learning Timothy A. Mann and Hugo Penedones and Todd Hester Google DeepMind London, United Kingdom {kingtim, hugopen, toddhester}@google.com Shie Mannor Electrical
More informationConvergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation Hamid R. Maei University of Alberta Edmonton, AB, Canada Csaba Szepesvári University of Alberta Edmonton, AB, Canada
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationReinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values
Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning abstract
More informationGeneralization and Function Approximation
Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationLaplacian Agent Learning: Representation Policy Iteration
Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)
More informationPolicy Evaluation with Temporal Differences: A Survey and Comparison
Journal of Machine Learning Research 15 (2014) 809-883 Submitted 5/13; Revised 11/13; Published 3/14 Policy Evaluation with Temporal Differences: A Survey and Comparison Christoph Dann Gerhard Neumann
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationValue Function Approximation in Zero-Sum Markov Games
Value Function Approximation in Zero-Sum Markov Games Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 {mgl,parr}@cs.duke.edu Abstract This paper investigates
More informationAnnouncements Kevin Jamieson
Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine
More informationConvergence of Synchronous Reinforcement Learning. with linear function approximation
Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht
More informationECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control
ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu
More informationBatch mode reinforcement learning based on the synthesis of artificial trajectories
DOI 10.1007/s10479-012-1248-5 Batch mode reinforcement learning based on the synthesis of artificial trajectories Raphael Fonteneau Susan A. Murphy Louis Wehenkel Damien Ernst Springer Science+Business
More informationModel-based Reinforcement Learning with State and Action Abstractions. Hengshuai Yao
Model-based Reinforcement Learning with State and Action Abstractions by Hengshuai Yao A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of
More informationFast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, 3 David Silver, Csaba Szepesvári,
More informationLinear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationarxiv: v1 [cs.lg] 6 Jun 2013
Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee Bruno Scherrer, LORIA MAIA project-team, Nancy, France, bruno.scherrer@inria.fr Matthieu Geist, Supélec IMS-MaLIS Research Group,
More informationToward Off-Policy Learning Control with Function Approximation
Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, Richard S. Sutton Department of Computing Science, University of Alberta, Edmonton, Canada T6G 2E8 Department of Computer Science and Automation, Indian
More informationPayments System Design Using Reinforcement Learning: A Progress Report
Payments System Design Using Reinforcement Learning: A Progress Report A. Desai 1 H. Du 1 R. Garratt 2 F. Rivadeneyra 1 1 Bank of Canada 2 University of California Santa Barbara 16th Payment and Settlement
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013
Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you
More informationFast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh
More informationTutorial on Policy Gradient Methods. Jan Peters
Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationStochastic Primal-Dual Methods for Reinforcement Learning
Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationOnline least-squares policy iteration for reinforcement learning control
2 American Control Conference Marriott Waterfront, Baltimore, MD, USA June 3-July 2, 2 WeA4.2 Online least-squares policy iteration for reinforcement learning control Lucian Buşoniu, Damien Ernst, Bart
More information