Reinforcement Learning
|
|
- Beatrice Beasley
- 6 years ago
- Views:
Transcription
1 Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien Ngo Marc Toussaint University of Stuttgart
2 Outline Function Approximation Gradient Descent Methods. Least-Square Temporal Difference. 2/??
3 Value Iteration in Continuous MDP [ V (s) = sup r(s, a) + γ a P (s s, a)v (s )dx ] 3/??
4 Continuous state/actions in model-free RL All of this is fine in small finite state & action spaces. Q(s, a) is a S A -matrix of numbers. π(a s) is a S A -matrix of numbers. In the following: two examples for handling continuous states/actions use function approximation to estimate Q(s, a): Gradient descent (TD with FA), LSPI. optimize a parameterized π(a s) (policy search - next lecture). 4/??
5 Value Function Approximation Estimate of the value function (from Satinder Singh, RL: A tutorial at videolectures.net) V t (s) = V (s, θ t ) 5/??
6 Performance Measure Minimizing the mean-squared error (MSE) over some distribution, P, of the states MSE(β t ) = s S P (s) [ V π (s) V t (s) ] 2 where V π (s) is the true value function of the policy π. Set P to the stationary distribution of policy π in on-policy learning methods (e.g. SARSA). 6/??
7 Value Function Approximation The estimate value function: V (s, β t ) = β t φ(s) where β R d is a vector of parameters, φ : S R d is a mapping from states to d-dimensional spaces. Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse of dimensionality) Nonparametric methods: k-nearest neighbor, nonparametric kernel smoothing, spline smoothers, Gaussian process regression,... 7/??
8 Value Function Approximation (10 35 states, 10 5 binary features and parameters.) (Sutton, presentation at ICML 2009) ) 8/??
9 TD(λ) with Function Approximation The gradient at any point β t MSE(β t ) = 2 s S P (s) [ V π (s) V t (s) ] V (s, β t ) = 2 s S P (s) [ V π (s) V t (s) ] φ(s) Applying stochastic approximation and bootstrapping, we can iteratively update the parameters (TD(0) with function approximation) TD(λ) (with eligibility trace) β t+1 = β t α t [ rt + γv (s, β t ) V t (s, β t ) ] φ(s) e t+1 = γλe t + φ(s) β t+1 = β t α t e t+1 [ rt + γv (s, β t ) V t (s, β t ) ] 9/??
10 TD(λ) with Function Approximation (Gradient-descent SARSA(λ)) Repeat (for each episode) e = 0 initial state s = s 0 Repeat a t = π(s) Take a, observe r t, s e t+1 = γλe t + φ(s, a) β t+1 = β t α t e t+1 [ rt + γq(s, π(s ), β t ) Q(s, a t, β t ) ] s s until s is terminal. 10/??
11 TD(λ) with Function Approximation Convergence proof: If the stochastic process S t is ergodic Markov process, the whose stationary distribution is the same as the stationary distribution of the underlying MDP (e.g. on-policy distribution). The convergence property MSE(β ) 1 γλ 1 λ MSE(β ) (Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997) Convergence guarantee for off-policy methods (e.g Q-learning with linear function approximation)? 11/??
12 Gradient temporal difference learning GTD (gradient temporal difference learning) GTD2 (gradient temporal difference learning, version 2) TDC (temporal difference learning with corrections.) 1. Sutton, Szepesveri and Maei. A convergent O(n) temporal difference algorithm for off-policy learning with linear function approximation, NIPS Sutton, Maei, Precup, Bhatnagar, Silver, Szepesvri, Wiewiora: Fast gradient-descent methods for temporal-difference learning with linear function approximation. ICML /??
13 Value function geometry Bellman operator T V = R + γp V (The space spanned by the feature vectors) RMSBE : Residual mean-squared Bellman error RMSPBE: Residual mean-squared projected Bellman error 13/??
14 TD performance measure Error from the true value: V β V Error in the Bellman update (used in previous section: gradient descent methods) V β T V β Error in Bellman update after projection V β T V β 14/??
15 TD performance measure GTD(0): the norm of the expected TD update NEU(β) = E(δφ) E(δφ) GTD(2) and TDC: the norm of the expected TD update weighted by the covariance matrix of the features (δ is the TD error.) MSP BE(β) = E(δφ) E(φφ) 1 E(δφ) (GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.) 15/??
16 TD algorithms with linear function approximation problems are guaranteed convergent under both general on- and off-policy training. the compuational complexity is only O(n) (n is the number of features). the curse of dimensionality is removed 16/??
17 LSPI: Least Squares Policy Iteration Gradient-descent methods are sensitive to the choice of learning rates and initial parameter values. Least-square temporal difference (LSTD) method: LSPI. Bellman residual minimization Least Squares Fixed-Point Approximation 17/??
18 Bellman residual minimization The Q-functions for a given policy π fulfills for any s, a: Q π (s, a) = R(s, a) + γ s P (s a, s) Q π (s, π(s )) If we have n data points D = {(s i, a i, r i, s i )}n i=1, we require that this equation holds (approximately) for these n data points: i : Q π (s i, a i ) = r i + γq π (s i, π(s i)) Written in vector notation: Q = R + g Q with N-dim data vectors Q, R, Q Written as optmization: minimize the Bellman residual error L(Q π ) = R + γp ΠQ π Q π (true residual) n = [Q π (s i, a i ) r i γq π (s i, π(s i))] 2 = R Q + γ Q 2 i=1 18/??
19 Bellman residual minimization The true fixed point of Bellman Residual Minimization (this is an overconstrained system) β π = ( (Φ γp ΠΦ) (Φ γp ΠΦ)) 1(Φ γp ΠΦ)r the solution β π of the system is unique since the columns of Φ (the basis functions) are linearly independent by definition. (See Lagoudakis & Parr (JMLR 2003) for details.) 19/??
20 LSPI: Least Squares Fixed-Point Approximation Projection T π Q back onto span(φ) ˆT π (Q) = Φ(Φ Φ) 1 Φ (T π Q) The approximate fixed-point β π = ( Φ (Φ γp ΠΦ)) 1Φ r 20/??
21 LSPI: Comparisons of two views the Bellman residual minimizing method: focus on the magnitude of the change. the least-squares fixed-point approximation: focus on the direction of the change. the least-squares fixed point approximation is less stable and less predictable the least-squares fixed-point method might be preferable. Because Learning the Bellman residual minimizing approximation requires doubled samples. Experimentally, it often delivers policies that are superior. (See Lagoudakis & Parr (JMLR 2003) for details.) 21/??
22 LSPI: LSTDQ algorithm For each (s, a, r, s ) D A = ( ) 1 Φ (Φ γp ΠΦ) b = Φ r A A + φ(s, a) ( φ(s, a) γφ(s, π(s )) ) β A 1 b b b + φ(s, a)r 22/??
23 LSPI algorithm given D repeat return π π π π LST DQ(π) (π is a policy of β π ) 23/??
24 LSPI: Riding a bike (from Alma A. M. Rahat s simulation.) States = {θ, θ, ω, ω, ω, ψ}. where θ is the angle of the handlebar, ω is the vertical angle of the bicycle, and ψ is the angle of the bicycle to the goal. Actions: {τ, ν}. τ { 2, 0, 2} is the torque applied to the handlebar, ν { 0.02, 0, 0.02} is the displacement of the rider. For each a, the value function Q(s, a) uses 20 features (1, ω, ω, ω 2, ω ω, θ, θ, θ 2, θ 2, θ θ, ωθ, ωθ 2, ω 2 θ, ψ, ψ 2, ψθ, ψ, ψ 2, ψθ) where ψ = sign(ψ) π ψ. 24/??
25 LSPI: Riding a bike from Lagoudakis & Parr (JMLR 2003) 25/??
26 LSPI: Riding a bike Training samples were collected in advance by initializing the bicycle to a small random perturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20 steps using a purely random policy. Each successful ride must complete a distance of 2 kilometers. This experiment was repeated 100 times from Lagoudakis & Parr (JMLR 2003) 26/??
27 Feature Selection/Building Problems Feature selection. Online/increment feature learning. Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010); Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng, (2009), Boots and Gordon, (2010), Sun et al., (2011), etc. 27/??
Reinforcement Learning
Reinforcement Learning Value Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More informationFast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationCS599 Lecture 2 Function Approximation in RL
CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)
More informationAn online kernel-based clustering approach for value function approximation
An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More informationLeast squares policy iteration (LSPI)
Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the
More informationOff-Policy Actor-Critic
Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory
More informationReinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values
Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning abstract
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationA Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation
A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More informationLaplacian Agent Learning: Representation Policy Iteration
Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)
More informationGeneralization and Function Approximation
Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
More informationDual Temporal Difference Learning
Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada
More informationChapter 8: Generalization and Function Approximation
Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationBasis Adaptation for Sparse Nonlinear Reinforcement Learning
Basis Adaptation for Sparse Nonlinear Reinforcement Learning Sridhar Mahadevan, Stephen Giguere, and Nicholas Jacek School of Computer Science University of Massachusetts, Amherst mahadeva@cs.umass.edu,
More informationRegularization and Feature Selection in. the Least-Squares Temporal Difference
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,
More informationParametric value function approximation: A unified view
Parametric value function approximation: A unified view Matthieu Geist, Olivier Pietquin To cite this version: Matthieu Geist, Olivier Pietquin. Parametric value function approximation: A unified view.
More informationRegularization and Feature Selection in. the Least-Squares Temporal Difference
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,
More informationConvergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation Hamid R. Maei University of Alberta Edmonton, AB, Canada Csaba Szepesvári University of Alberta Edmonton, AB, Canada
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationChapter 13 Wow! Least Squares Methods in Batch RL
Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationOff-policy Learning With Eligibility Traces: A Survey
Journal of Machine Learning Research 15 2014) 289-333 Submitted 9/11; Revised 4/13; Published 1/14 Off-policy Learning With Eligibility Traces: A Survey Matthieu Geist IMS-MaLIS Research Group & UMI 2958
More informationReplacing eligibility trace for action-value learning with function approximation
Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More information1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013
1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013 Online Selective Kernel-Based Temporal Difference Learning Xingguo Chen, Yang Gao, Member, IEEE, andruiliwang
More informationAlgorithms for Fast Gradient Temporal Difference Learning
Algorithms for Fast Gradient Temporal Difference Learning Christoph Dann Autonomous Learning Systems Seminar Department of Computer Science TU Darmstadt Darmstadt, Germany cdann@cdann.de Abstract Temporal
More informationLeast-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems
: Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationPolicy Evaluation with Temporal Differences: A Survey and Comparison
Journal of Machine Learning Research 15 (2014) 809-883 Submitted 5/13; Revised 11/13; Published 3/14 Policy Evaluation with Temporal Differences: A Survey and Comparison Christoph Dann Gerhard Neumann
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationRecurrent Gradient Temporal Difference Networks
Recurrent Gradient Temporal Difference Networks David Silver Department of Computer Science, CSML, University College London London, WC1E 6BT d.silver@cs.ucl.ac.uk Abstract Temporal-difference (TD) networks
More informationLinear Least-squares Dyna-style Planning
Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationReinforcement Learning. Summer 2017 Defining MDPs, Planning
Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationFinite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results
Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of
More informationSparse Kernel-SARSA(λ) with an Eligibility Trace
Sparse Kernel-SARSA(λ) with an Eligibility Trace Matthew Robards 1,2, Peter Sunehag 2, Scott Sanner 1,2, and Bhaskara Marthi 3 1 National ICT Australia Locked Bag 8001 Canberra ACT 2601, Australia 2 Research
More informationThe Fixed Points of Off-Policy TD
The Fixed Points of Off-Policy TD J. Zico Kolter Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 2139 kolter@csail.mit.edu Abstract Off-policy
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationEfficient Average Reward Reinforcement Learning Using Constant Shifting Values
Efficient Average Reward Reinforcement Learning Using Constant Shifting Values Shangdong Yang and Yang Gao and Bo An and Hao Wang and Xingguo Chen State Key Laboratory for Novel Software Technology, Collaborative
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More information6 Basic Convergence Results for RL Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 6 Basic Convergence Results for RL Algorithms We establish here some asymptotic convergence results for the basic RL algorithms, by showing
More informationGradient Temporal Difference Networks
JMLR: Workshop and Conference Proceedings 24:117 129, 2012 10th European Workshop on Reinforcement Learning Gradient Temporal Difference Networks David Silver d.silver@cs.ucl.ac.uk Department of Computer
More informationLeast-Squares Temporal Difference Learning based on Extreme Learning Machine
Least-Squares Temporal Difference Learning based on Extreme Learning Machine Pablo Escandell-Montero, José M. Martínez-Martínez, José D. Martín-Guerrero, Emilio Soria-Olivas, Juan Gómez-Sanchis IDAL, Intelligent
More informationDeep Reinforcement Learning via Policy Optimization
Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to
More informationConvergence of Synchronous Reinforcement Learning. with linear function approximation
Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht
More informationFast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, 3 David Silver, Csaba Szepesvári,
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationOn and Off-Policy Relational Reinforcement Learning
On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr
More informationAn Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
Journal of Machine Learning Research 7 206-29 Submitted /4; Revised /5; Published 5/6 An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning Richard S. Sutton sutton@cs.ualberta.ca
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationSparse Q-learning with Mirror Descent
Sparse Q-learning with Mirror Descent Sridhar Mahadevan and Bo Liu Computer Science Department University of Massachusetts, Amherst Amherst, Massachusetts, 13 {mahadeva, boliu}@cs.umass.edu Abstract This
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More information15-780: ReinforcementLearning
15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationLinear Feature Encoding for Reinforcement Learning
Linear Feature Encoding for Reinforcement Learning Zhao Song, Ronald Parr, Xuejun Liao, Lawrence Carin Department of Electrical and Computer Engineering Department of Computer Science Duke University,
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns
More informationarxiv: v1 [cs.ai] 5 Nov 2017
arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationOff-Policy Learning Combined with Automatic Feature Expansion for Solving Large MDPs
Off-Policy Learning Combined with Automatic Feature Expansion for Solving Large MDPs Alborz Geramifard Christoph Dann Jonathan P. How Laboratory for Information and Decision Systems Massachusetts Institute
More informationarxiv: v1 [cs.sy] 29 Sep 2016
A Cross Entropy based Stochastic Approximation Algorithm for Reinforcement Learning with Linear Function Approximation arxiv:1609.09449v1 cs.sy 29 Sep 2016 Ajin George Joseph Department of Computer Science
More informationEmphatic Temporal-Difference Learning
Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University
More information