Reinforcement Learning

Similar documents
Reinforcement Learning

Reinforcement Learning

Lecture 7: Value Function Approximation

Reinforcement Learning

Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CS599 Lecture 2 Function Approximation in RL

Grundlagen der Künstlichen Intelligenz

Chapter 8: Generalization and Function Approximation

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

(Deep) Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Lecture 23: Reinforcement Learning

Generalization and Function Approximation

Temporal difference learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Lecture 8: Policy Gradient

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Approximate Dynamic Programming

Off-Policy Actor-Critic

Approximation Methods in Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning

Reinforcement Learning: An Introduction. ****Draft****

Approximate Dynamic Programming

CS599 Lecture 1 Introduction To RL

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Approximate Dynamic Programming

Reinforcement Learning. George Konidaris

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Approximate Q-Learning. Dan Weld / University of Washington

Open Theoretical Questions in Reinforcement Learning

6 Reinforcement Learning

Introduction to Convolutional Neural Networks (CNNs)

Reinforcement Learning Part 2

An online kernel-based clustering approach for value function approximation

Machine Learning I Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 1: March 7, 2018

Reinforcement Learning II. George Konidaris

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Notes on Reinforcement Learning

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Reinforcement Learning II. George Konidaris

Reading Group on Deep Learning Session 1

Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Reinforcement Learning as Classification Leveraging Modern Classifiers

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture 4: Approximate dynamic programming

Deep Reinforcement Learning: Policy Gradients and Q-Learning

REINFORCEMENT LEARNING

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Introduction to Reinforcement Learning

Least squares temporal difference learning

An Introduction to Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Reinforcement Learning for NLP

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Grundlagen der Künstlichen Intelligenz

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

Replacing eligibility trace for action-value learning with function approximation

Lecture 9: Policy Gradient II 1

Statistical Machine Learning from Data

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

Reinforcement Learning II

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Deep Learning Lab Course 2017 (Deep Learning Practical)

Lecture 9: Policy Gradient II (Post lecture) 2

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

ilstd: Eligibility Traces and Convergence Analysis

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Reinforcement Learning as Variational Inference: Two Recent Approaches

Neural Networks and Deep Learning

Reinforcement Learning

Basics of reinforcement learning

Algorithms for Fast Gradient Temporal Difference Learning

Policy Gradient Methods. February 13, 2017

Artificial Neural Networks. MGS Lecture 2

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning

Reinforcement Learning and NLP

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Lecture 17: Neural Networks and Deep Learning

Transcription:

Reinforcement Learning Value Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Hung Ngo & Vien Ngo MLR Lab, University of Stuttgart

Outline Review: Function Approximation (FA) Representation issues: hypothesis and features Linear and nonlinear models Deep learning RL with FA Gradient descent methods Estimating target values Linear RL (online & batch) Nonlinear methods; Deep RL 2/57

Example: Value Iteration in Continuous MDP [ V (s) = sup r(s, a) + γ a P (s s, a)v (s )ds ] 3/57

Model-free RL with Large/Continuous Domains RL with table-lookup (tabular) methods: Only feasible for small finite state action spaces: exponential growth. Q(s, a), π(a s): tables/matrices of size S A 4/57

Model-free RL with Large/Continuous Domains RL with table-lookup (tabular) methods: Only feasible for small finite state action spaces: exponential growth. Q(s, a), π(a s): tables/matrices of size S A RL in large and/or continuous state and/or action spaces: Backgammon: 10 20 states (board size: 28) Computer GO: 10 35 states (9 9); 10 171 states (19 19) Autonomous helicopter: continuous & high-dim spaces of S and A 4/57

Model-free RL with Large/Continuous Domains RL with table-lookup (tabular) methods: Only feasible for small finite state action spaces: exponential growth. Q(s, a), π(a s): tables/matrices of size S A RL in large and/or continuous state and/or action spaces: Backgammon: 10 20 states (board size: 28) Computer GO: 10 35 states (9 9); 10 171 states (19 19) Autonomous helicopter: continuous & high-dim spaces of S and A Scalability and generalization issues In this course: learn/optimize parametric models for value functions V (s; θ), Q(s, a; θ): this lecture for policy π(a s; θ): policy search - next lecture 4/57

Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) 5/57

Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) Generalizes to unvisited states (state-action pairs) 5/57

Value Function Approximation Parameterized by θ: V t (s) V (s; θ t ), Q t (s, a) Q(s, a; θ t ) Generalizes to unvisited states (state-action pairs) Many choices for θ (i.e., V (s; θ), Q(s, a; θ)): linear regression, decision tree, nearest neighbors, ANN (UFA), GPR, kernel methods, etc. 5/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. Loss function l(g, x, y) R: MSE, #mistakes, neg.l.l., hinge, etc. 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. Loss function l(g, x, y) R: MSE, #mistakes, neg.l.l., hinge, etc. Generalization & regularization: Occam s razor; prefers simpler W. 6/57

Review: Supervised Learning Training data: D = {x i, y i } N i=1, x i R d, y i Y R m Linear learner: g(z), z = W x, W R m d, g : R m R m ŷ = g(w x): regression; ŷ = arg max i {1,...,m} g i (W x): classification g: transfer function for desirable output, e.g., i) identity g(z) = z for regression/class margin, ii) class posterior probability P Y (i x; W ) e.g., sigmoid for binary classification, softmax for multiclass. Loss function l(g, x, y) R: MSE, #mistakes, neg.l.l., hinge, etc. Generalization & regularization: Occam s razor; prefers simpler W. Learning: optimize W w.r.t. l D ; gradient-based, PA, CW, etc. 6/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) trained together as g(w φ(x)), e.g., using back-propagation 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) trained together as g(w φ(x)), e.g., using back-propagation Universal function approximator! 7/57

Review: Input Representation (Feature Extraction) ( ) Nonlinear features φ : R d R k (linear) learner: g W φ(x) linear in feature space Φ, but highly nonlinear in input space X. i.e., increase expressiveness of the model w.r.t. X better learner! Fixed features: designed by domain experts. E.g.: Tile-coding, SIFT, Ngram, MFCC, basis functions, φ(x) = (1, x 1,..., x d, x 2 1, x 1 x 2,..., x 2 d ) Kernel trick: k(x 1, x 2 ) = φ(x 1 ) φ(x 2 ), e.g., (x 1 x 2 + c) n, gaussian... Learned features: φ(x) = h(w h x), an artificial neuron. Deep NN h: nonlinear activation func.; sigmoid, tanh, rectifier/hinge max(0, z) Multiple layers of features: φ(x) = h j (W j (h j 1 (W j 1 (...h 0 (W 0 x)))) trained together as g(w φ(x)), e.g., using back-propagation Universal function approximator! Each feature φ i encodes a landmark, prototype, regularity, pattern Each feature value φ i (x): similarity/distance measure of x and φ i 7/57

Example: Coarse Coding (2D) Generalization from X to Y depends on the number of their features whose receptive fields (circles) overlap. Here: slight generalization. a) Narrow generalization b) Broad generalization c) Asymmetric generalization Sparse, over-complete, distributed representation 8/57

Example: Tile Coding (2D) Receptive fields of the features are grouped into exhaustive partitions of the input space. Each such partition: a tiling, and each element of the partition: a tile. Each tile is the receptive field for one binary feature. tiling #1 tiling #2 2D state space Shape of tiles Generalization #Tilings Resolution of final approximation Exactly one feature is present in each tiling, so the total number of features present is always the same as the number of tilings. Aka CMAC (cerebellar model articulation controller) 9/57

Example: Radial Basis Functions φ i (x) = rbf(d i (x)), e.g.: Gaussian kernel, rbf(d) = exp( d2 2 ) d 2 i (x) = x c i 2 Σ i = (x c i ) Σ 1 i (x c i ) : distance in metric Σ i ( e.g., Σ i = σi 2 I : φ i (x) = exp x c i 2 ) 2σ 2 i Prototype/center c i, kernel width σ i σ i c i-1 c i c i+1 RBF s receptive field has continuous coverage Previous methods receptive fields discretize the input space. If parameters {c i, Σ i } are learned: RBF network. 10/57

amples of SVMs with Kernels Example: Polynomial Kernel y y 2 x φ x 2 x y 11/57

Example: Tabular Basis Naive features for finite discrete state space S = {s 1,..., s n } δ s1 (s) δ s2 (s) φ(s) =... δ sn (s) δ si (s) {0, 1}: Dirac function. 1-of-K coding, or one-hot vector. 12/57

Example: Tabular Basis Naive features for finite discrete state space S = {s 1,..., s n } δ s1 (s) δ s2 (s) φ(s) =... δ sn (s) δ si (s) {0, 1}: Dirac function. 1-of-K coding, or one-hot vector. For (simple) generalization: bias weight & augmented feature 1 δ s1 (s) φ(s) = δ s2 (s)... δ sn (s) 12/57

Example: Perceptron, Frank Rosenblatt, 1957! fixed features φ(x) Paul Werbos Backpropagation for MLP: 1974. Kunihiko Fukushima s deep CNN: 1980. 13/57

Example: Deep Learning, 2012 14/57

Example: Deep Learning, 2012 A lot of training data available + computational power (GPU) Good learning priors (hierarchical, modular, sparse-coding, pooling, regularizers e.g. dropouts, invariances etc.) and training algs. 15/57

Review: Gradient Descent (Steepest Descent) For f : R n R: f(x + δx) f(x) + x f δx + 1 2 δx H x δx steepest descent direction: ˆδx = x f (true only for small stepsize) Effects: initial point, direction, stepsize; 2-nd order methods 2nd Order Plain Gradient Conjugate Gradient 16/57

Approximate RL: Formulation & Solution Techniques 17/57

Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 18/57

Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D ( V π (s) V (s; θ) ) θ V (s; θ) = 0 18/57

Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 18/57

Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 GD sol.: batch (offline), SGD (incremental/online), mini-batch (hybrid) SGD samples the (full) gradient at each s t / (s t, a t ) 18/57

Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 GD sol.: batch (offline), SGD (incremental/online), mini-batch (hybrid) SGD samples the (full) gradient at each s t / (s t, a t ) Least mean squares (LMS), aka Widrow-Hoff rule: SGD update θ t+1 = θ t + α t ( V π (s t ) V (s t ; θ t ) ) θ V (s t ; θ t ) 18/57

Approx. RL: Supervised Learning Formulation Given training data: D = {s t, a t, r t+1, s t+1 } H t=0 some distribution P ( ) Minimizing the mean-squared error (similarly for Q π (s, a)) [ (V L(θ) = E π s P ( ) (s) V (s; θ) ) ] [ 2 (V E π s D (s) V (s; θ) ) ] 2 Analytical sol.: θ L(θ) = 0 s D E.g., linear least squares methods (LSTD, LSPI) ( V π (s) V (s; θ) ) θ V (s; θ) = 0 GD sol.: batch (offline), SGD (incremental/online), mini-batch (hybrid) SGD samples the (full) gradient at each s t / (s t, a t ) Least mean squares (LMS), aka Widrow-Hoff rule: SGD update θ t+1 = θ t + α t ( V π (s t ) V (s t ; θ t ) ) θ V (s t ; θ t ) Combined with experience replay. Store all real-world experience in D + (iid) sampling for SGD 18/57

Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) 19/57

Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) 19/57

Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) TD(0): θ t+1 = θ t + α t ( rt+1 + γv (s t+1 ; θ t ) V (s t ; θ t ) ) θ V (s t ; θ t ) 19/57

Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) TD(0): θ t+1 = θ t + α t ( rt+1 + γv (s t+1 ; θ t ) V (s t ; θ t ) ) θ V (s t ; θ t ) TD(λ): biased for λ < 1 Forward: θ t+1 = θ t + α t ( R λ t V (s t ; θ t ) ) θ V (s t ; θ t ) Backward: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t with e t = γλe t 1 + θ V (s t ; θ t ) 19/57

Approximate RL: Estimating Target Value No true value functions V π (s)/q π (s, a) given by supervisor! use an estimate v t in place of target V π (s t ) unbiased estimate if E P [ v(st ) ] = V π (s t ) MC: θ t+1 = θ t + α t ( Rt V (s t ; θ t ) ) θ V (s t ; θ t ) TD(0): θ t+1 = θ t + α t ( rt+1 + γv (s t+1 ; θ t ) V (s t ; θ t ) ) θ V (s t ; θ t ) TD(λ): biased for λ < 1 Forward: θ t+1 = θ t + α t ( R λ t V (s t ; θ t ) ) θ V (s t ; θ t ) Backward: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t with e t = γλe t 1 + θ V (s t ; θ t ) Similarly, estimate q t for action-value function Q (s t, a t ) (control) θ t+1 = θ t +α t ( rt+1 + γ max a Q(s t+1, a ; θ t ) Q(s t, a t ; θ t ) ) θ Q(s t, a t ; θ t ) 19/57

Linear RL 20/57

RL with Linear Function Approximation Input representation: fixed features φ 21/57

RL with Linear Function Approximation Input representation: fixed features φ Hypothesis representation: θ t a weight vector prediction = linear combination of feature values, i.e., dot product V (s; θ) = θ φ(s), Q(s, a; θ) = θ φ(s, a), θ R d 21/57

RL with Linear Function Approximation Input representation: fixed features φ Hypothesis representation: θ t a weight vector prediction = linear combination of feature values, i.e., dot product V (s; θ) = θ φ(s), Q(s, a; θ) = θ φ(s, a), θ R d Learning: θ t+1 θ t given (s t, a t, r t+1, s t+1 ) Update rule is particularly simple with θ V (s; θ) = φ(s) 21/57

RL with Linear Function Approximation Input representation: fixed features φ Hypothesis representation: θ t a weight vector prediction = linear combination of feature values, i.e., dot product V (s; θ) = θ φ(s), Q(s, a; θ) = θ φ(s, a), θ R d Learning: θ t+1 θ t given (s t, a t, r t+1, s t+1 ) Update rule is particularly simple with θ V (s; θ) = φ(s) (9 9, 10 35 states, 10 5 binary features and parameters, R. Sutton ICML 09) 21/57

Linear TD(λ) for Online Prediction TD(0) with function approximation: θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] φ(s) TD(λ) (with eligibility trace): e t = γλe t 1 + φ(s t ) θ t+1 = θ t + α t [ rt + γv (s t+1 ; θ t ) V (s t ; θ t ) ] e t Similarly for action-value function Q π (s t, a t ) 22/57

Linear SARSA(λ) for On-Policy Control Initialize θ = 0 (a vector of zeros) While (!converge) e = 0 initialize state s S 0, a π(s), e.g., ɛ-greedy w.r.t. Q(s, ; θ) Repeat (for each step of episode) Execute a, observe r, s e e + φ(s, a) If s is terminal, then θ θ + αe(r Q(s, a; θ)); go to next episode a π(s ), e.g., ɛ-greedy w.r.t. Q(s, ; θ) θ θ + αe [ r + γq(s, a ; θ) Q(s, a; θ) ] e γλe s s, a a 23/57

Linear Q(λ) for Off-Policy Control Initialize θ = 0 (a vector of zeros) While (!converge) e = 0 initialize state s S 0 Repeat (for each step of episode) a π(s), e.g., ɛ-greedy w.r.t. Q(s, ; θ) If a arg max a Q(s, a ; θ), then e = 0 Execute a, observe r, s e e + φ(s, a) If s is terminal, then θ θ + αe(r Q(s, a; θ)); go to next episode θ θ + αe [ r + γmax a Q(s, a ; θ) Q(s, a; θ) ] e γλe s s 24/57

SARSA(λ) with tile-coding function approximation 25/57

λ: How Much Bootstrapping? 26/57

Linear Least Squares Prediction Recall: batch (offline) training of linear models has analytical solution θ L(θ) = 0 0 = s D ( V π (s) V (s; θ) ) θ V (s; θ) 0 = ( ) V π (s) φ(s) θ φ(s) s D ( θ = φ(s)φ(s) ) 1 φ(s)v π (s) s D s D 27/57

Linear Least Squares Prediction Recall: batch (offline) training of linear models has analytical solution θ L(θ) = 0 0 = s D ( V π (s) V (s; θ) ) θ V (s; θ) 0 = ( ) V π (s) φ(s) θ φ(s) s D ( θ = φ(s)φ(s) ) 1 φ(s)v π (s) s D s D Shermann-Morrison incremental update (O(d 2 )) of A 1 (A + uv ) 1 = A 1 A 1 uv A 1 1 + v A 1 u = B Buv B 1 + v Bu 27/57

Linear Least Squares Prediction Again, many choices of estimate v(s t ) for true target V π (s t ) LSMC: v(s t ) = R t LSTD: v(s t ) = r t+1 + γv (s t+1 ; θ) = r t+1 + γφ(s t ) θ LSTD(λ): v(s t ) = R λ t 28/57

Linear Least Squares Prediction Again, many choices of estimate v(s t ) for true target V π (s t ) LSMC: v(s t ) = R t LSTD: v(s t ) = r t+1 + γv (s t+1 ; θ) = r t+1 + γφ(s t ) θ LSTD(λ): v(s t ) = R λ t In each case solve directly for fixed point of MC / TD(0) / TD(λ) LSMC: ( θ = φ(s)φ(s) ) 1 φ(s)r(s) s D s D LSTD: ( θ = φ(s)(φ(s) γφ(s )) ) 1 φ(s)r(s) s D s D LSTD(λ): ( θ = E(s)(φ(s) γφ(s )) ) 1 E(s)r(s), s D s D αδ(s)e(s) = 0 s D 28/57

Linear Least Squares Control Policy evaluation: off-policy LSTDQ ( θ = φ(s, a)(φ(s, a) γφ(s, π(s ))) ) 1 φ(s, a)r(s) s D s D 29/57

Linear Least Squares Control Policy evaluation: Lecture 6: Value off-policy Function Approximation LSTDQ ( Least Squares Control θ = φ(s, a)(φ(s, a) γφ(s, π(s ))) ) 1 φ(s, a)r(s) s D Batch Methods Least Squares Policy Iteration Algorithm The following pseudocode uses LSTDQ for policy evaluation Policy improvement: LSPI iterates over D It repeatedly re-evaluates experience D with di erent policies function LSPI-TD(D, 0 ) 0 0 repeat 0 Q LSTDQ(, D) for all s 2 S do 0 (s) end for until ( 0 ) return end function argmax a2a Q(s, a) s D 29/57

LSPI: Bellman residual minimization The Q-functions for a given policy π fulfills for any s, a: Q π (s, a) = R(s, a) + γ s P (s a, s) Q π (s, π(s )) Written as optmization: minimize the Bellman residual error L(Q π ) = R + γp ΠQ π Q π where Π is the projection operator onto the feature space. 30/57

LSPI: Bellman residual minimization The true fixed point of Bellman Residual Minimization (this is an overconstrained system) β π = ( (Φ γp ΠΦ) (Φ γp ΠΦ)) 1(Φ γp ΠΦ)r the solution β π of the system is unique since the columns of Φ (the basis functions) are linearly independent by definition. (See Lagoudakis & Parr (JMLR 2003) for details.) 31/57

LSPI: Least Squares Fixed-Point Approximation Projection T π Q back onto span(φ) ˆT π (Q) = Φ(Φ Φ) 1 Φ (T π Q) Minimize projected Bellman residual ˆT π (Q) = Q The approximate fixed-point β π = ( Φ (Φ γp ΠΦ)) 1Φ r 32/57

LSPI: Comparisons of two views Bellman residual minimization: focus on the magnitude of the change. LS fixed-point approximation: focus on the direction of the change. Least-squares fixed point approximation is less stable and less predictable Least-squares fixed-point method might be preferable because Learning the Bellman residual minimizing approximation requires doubled samples. Experimentally, it often delivers policies that are superior. (See Lagoudakis & Parr (JMLR 2003) for details.) 33/57

Convergence Properties Prediction On/O -Policy Algorithm Table Lookup Linear Non-Linear MC 3 3 3 On-Policy LSMC 3 3 - TD 3 3 7 LSTD 3 3 - O -Policy MC 3 3 3 LSMC 3 3 - TD 3 7 7 LSTD 3 3 - Lecture 6: Value Function Approximation Batch Methods Least Squares Control Convergence of Control Algorithms Control Algorithm Table Lookup Linear Non-Linear Monte-Carlo Control 3 (3) 7 Sarsa 3 (3) 7 Q-learning 3 7 7 LSPI 3 (3) - (3) = chatters around near-optimal value function 34/57

Linear Q-Learning Diverges Sometimes A Counter-example from Baird (1995) (due to bootstrapping + not following true gradients) 35/57

Using exact update (DP backups) for policy evaluation θ t+1 = θ t + α s P (s) [ ] E{r t+1 + γv t s s t = s} V (s; θ t ) θ V t (s; θ t ) if P (s) is uniform ( the true stationary distribution of the Markov chain), then the asymptotic bahaviour becomes unstable 36/57

Problems with Gradient-Descent TD Methods With nonlinear FA, on-policy methods like SARSA may still diverge (Tsitsiklis and Van Roy, 1997) With linear FA, off-policy like Q-learning may diverge (Baird 1995) They are not true gradient-descent methods 37/57

Problems with Gradient-Descent TD Methods With nonlinear FA, on-policy methods like SARSA may still diverge (Tsitsiklis and Van Roy, 1997) With linear FA, off-policy like Q-learning may diverge (Baird 1995) They are not true gradient-descent methods Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all above problems: true gradient direction of projected Bellman error. 37/57

Gradient temporal difference learning GTD (gradient temporal difference learning) GTD2 (gradient temporal difference learning, version 2) TDC (temporal difference learning with gradient corrections.) 38/57

Value Function Geometry Bellman operator T V = R + γp V (The space spanned by the feature vectors) RMSBE : residual mean-squared Bellman error RMSPBE: residual mean-squared projected Bellman error 39/57

TD performance measure Error from the true value: V θ V 2 P Error in the Bellman update (used in previous section: TD(0), GTD0 methods) V θ T V θ 2 P Error in Bellman update after projection (TDC and GTD2 methods) V θ ΠT V θ 2 P 40/57

TD performance measure GTD(0): the Norm of the Expected TD Update NEU(θ) = E(δφ) E(δφ) GTD(2) and TDC: the norm of the expected TD update weighted by the covariance matrix of the features (δ is the TD error.) MSP BE(θ) = E(δφ) E(φφ ) 1 E(δφ) (GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.) 41/57

Updates GTD(0) θ t+1 = θ t + α t (φ(s t ) γφ(s t+1 ))φ(s t ) w t π(s t, a t ) w t+1 = w t + β t (δ t φ(s t ) w t )π(s t, a t ) GTD2 θ t+1 = θ t + α t (φ(s t ) γφ(s t+1 ))φ(s t ) w t w t+1 = w t + β t (δ t φ(s t ) w t )φ(s t ) TDC θ t+1 = θ t + α t δ t φ(s t ) α t γφ(s t+1 )φ(s t ) w t w t+1 = w t + β t (δ t φ(s t ) w t )φ(s t ) where δ t = r t + γφ(s t+1 ) φ(s t ) is the TD error. 42/57

On Baird s counterexample. 43/57

Gradient TD: Summary Gradient TD algorithms with linear function approximation problems are guaranteed convergent under both general on- and off-policy training. the compuational complexity is only O(n) (n is the number of features). the curse of dimensionality is removed 44/57

Nonlinear & Deep RL 45/57

TD-Gammon By Gerry Tesauro (1992, 1994, 1995). World-class level. Training data: playing about 300,000 games against itself 24 23 22 21 20 19 18 17 16 15 14 13 white pieces move counterclockwise 1 2 3 4 5 6 7 8 9 10 11 12 black pieces move clockwise 46/57

ore pieces on the point, then all of the first three units were 1. If there than three pieces, the fourth unit also came on, to a degree indicating r of additional pieces beyond three. Letting n denote the total number of he point, TD-Gammon if n>3, then the fourth unit took on the value (n 3)/2. With for white and four for black at each of the 24 points, that made a total s. Two additional units encoded the number of white and black pieces (each took TD-Gammon the value n/2, used where the n is gradient-descent the number of pieces form on theof bar), the TD(λ) ore encoded with the number the gradients of black and computed white pieces by the already error successfully backpropagation algorithm predicted probability of winning, ˆv(S V t t, ) TD error TD ˆv(S error, V t+1! V t+1, ) ˆv(S t, ) t............ hidden units (40-80)...... backgammon position (198 input units) Figure 15.2: The neural network used in TD-Gammon Later versions combine learned value functions and decide-time search 47/57

Fitted Q-Iteration Given all experience D = {(s t, a t, r t+1, s t+1 )} H t=0. FQI approximates Q(s, a; θ) using offline (batch, not SGD) supervised regression by constructing a training dataset at each iteration k D k = {(s t, a t ), Q(s t, a t ; θ k )} H t=0 where Q(s t, a t ; θ k ) = r t+1 + γ max a Q(s t, a ; θ k 1 ) 48/57

Fitted Q-Iteration Given all experience D = {(s t, a t, r t+1, s t+1 )} H t=0. FQI approximates Q(s, a; θ) using offline (batch, not SGD) supervised regression by constructing a training dataset at each iteration k D k = {(s t, a t ), Q(s t, a t ; θ k )} H t=0 where Q(s t, a t ; θ k ) = r t+1 + γ max a Q(s t, a ; θ k 1 ) Hence, FQI is considered as Approximate Q-Iteration At each iteration, any regression techniques could be used: neural network, radial basis function networks, regression trees, etc. 48/57

Fitted Q-Iteration: Mountain Car Benchmark The control interval is t = 0.05s Actions are continuous, restricted within the interval [4, 4] Using a neural network with RPROP: configuration of 3-5-5-1 (input-hidden-hidden-output layers) Training trajectories had a maximum length of 50 primitive control steps. After each trajectory/episode, run one FQI iteration Neural fitted Q-learning by Martin Riedmiller, 2005. 49/57

24 Deep Fitted Q-Iteration: Neuro Slot Car Racer target: reconstruction Deep Autoencoder high-dimensional identity gradient descent feature space improved by Reinforcement Learning bottle neck policy low-dimensional CHAPTER 3. FEATURE maps SPACE feature vectors to actions After the pretraining process, the values network is trained by RProp[RB92], identica input: vector of pixel to the training of a regular feed-forward network. This setup, known as a Deep network, has been shown toaction produce exceptionally well, non-linear embeddings of a system data. [Lan10]. The deep networks used in this thesis all share the same outer layer 50/57

Deep Q-Networks (DQN) Given the training data D = {(s t, a t, r t+1, s t+1 )} H t=0 recall the squared-error L(θ) = 1 D H t=0 ( ) 2 Q π (s t, a t ) Q(s t, a t ; θ) = E D (Q π (s t, a t ) Q(s t, a t ; θ) E D (r t+1 + max Q π (s t+1, a; ˆθ) Q(s t, a t ; θ) a ) 2 ) 2 ˆθ is called target network : a fixed model trained several steps before. using stochastic gradient descent at time t sample k-th data point (s, a, r, s ) D SGD update θ k+1 = θ k + α t (r t + max Q π (s, a, ˆθ t t ) ˆQ(s, ) a; θ k ) θ Q(s, a; θ k ) a 51/57

Deep Q-Networks (DQN) Q(s, a) is approximated using a deep CNN. Initialize θ t, ˆθ (time index t) then loop until convergence take action a t using ɛ-greedy w.r.t Q(s t, a; θ t) add (s t, a t, r t, s t+1) into replay memory D. sample a subset of (s i, a i, r i, s i+1) from D optimize the least squared error on (as in previous slide), ρ θ = 1 ( s i r i + γ max a Q(s i+1, a; ˆθ t) ˆQ(s ) 2 i, a i, θ) and return trained model as θ t+1 update target network ˆθ = θ t+1 after t steps 52/57

DQN in Atari Games from David Silver, a NIPS talk 2014 53/57

DQN in Atari Games Q(s, a) is approximated using a deep CNN. input s is stack of raw pixels from 4 consecutive frames control a is 18 joystick/button positions from David Silver, a NIPS talk 2014 54/57

DQN in Atari Games 55/57

DQN in Atari Games 56/57

DQN in Atari Games Experience Replay vs. no Replay from David Silver, a NIPS talk 2014 57/57