Approximate Dynamic Programming

Similar documents
Approximate Dynamic Programming

Approximate Dynamic Programming

Finite-Sample Analysis in Reinforcement Learning

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming

Analysis of Classification-based Policy Iteration Algorithms

Chapter 13 Wow! Least Squares Methods in Batch RL

MDP Preliminaries. Nan Jiang. February 10, 2019

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization

Markov Decision Processes Infinite Horizon Problems

Reinforcement Learning

Reinforcement Learning

Lecture 7: Value Function Approximation

Analysis of Classification-based Policy Iteration Algorithms. Abstract

The Multi-Arm Bandit Framework

Classification-based Policy Iteration with a Critic

Internet Monetization

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

A Dantzig Selector Approach to Temporal Difference Learning

The Art of Sequential Optimization via Simulations

Reinforcement Learning

MATH4406 Assignment 5

On the Convergence of Optimistic Policy Iteration

Introduction to Reinforcement Learning

Least squares policy iteration (LSPI)

Classification-based Policy Iteration with a Critic

Empirical Risk Minimization

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning

Basics of reinforcement learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Robotics. Control Theory. Marc Toussaint U Stuttgart

Laplacian Agent Learning: Representation Policy Iteration

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Analyzing Feature Generation for Value-Function Approximation

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

Tutorial on Policy Gradient Methods. Jan Peters

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Lecture 1: March 7, 2018

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning

Understanding (Exact) Dynamic Programming through Bellman Operators

Optimal Stopping Problems

Gradient Estimation for Attractor Networks

CSE250A Fall 12: Discussion Week 9

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Reinforcement Learning

Linear Least-squares Dyna-style Planning

Reinforcement learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Confidence intervals for the mixing time of a reversible Markov chain from a single sample path

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning. Introduction

Central-limit approach to risk-aware Markov decision processes

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Real Time Value Iteration and the State-Action Value Function

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544

Lecture 3: Markov Decision Processes

Nonparametric Bayesian Methods (Gaussian Processes)

Abstract Dynamic Programming

Approximate dynamic programming for stochastic reachability

Notes on Tabular Methods

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Value and Policy Iteration

Least-Squares Methods for Policy Iteration

Reinforcement Learning and NLP

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Fitted Q-iteration in continuous action-space MDPs

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Markov Decision Processes and Dynamic Programming

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Multi-armed bandit models: a tutorial

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Generalization theory

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

The convergence limit of the temporal difference learning

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Review: Support vector machines. Machine learning techniques and image analysis

Lecture 7: Kernels for Classification and Regression

Reinforcement Learning: An Introduction

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

1 Stochastic Dynamic Programming

Using first-order logic, formalize the following knowledge:

Some AI Planning Problems

Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Transcription:

Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course

Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-2/82

Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-2/82

From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-3/82

From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-3/82

From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). Can we rely on samples? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-3/82

From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-4/82

From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-4/82

From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). Can we use approximations? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-4/82

The Objective Find a policy π such that the performance loss V V π is as small as possible A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-5/82

From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-6/82

From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-6/82

From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e. performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-6/82

From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-7/82

From Approximation Error to Performance Loss Proof. V V π T V T π V + T π V T π V π T V T V + γ V V π γ V V + γ( V V + V V π ) 2γ 1 γ V V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-8/82

Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-9/82

From Approximation Error to Performance Loss Question: how do we compute a good V? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-10/82

From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-10/82

From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. Solution: value iteration tends to learn functions which are close to the optimal value function V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-10/82

Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-11/82

Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). Problem: how can we approximate T Q k? Problem: if Q k+1 T Q k, does (approx.) value iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-11/82

Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-12/82

Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) with features F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 ϕ j : X A [0, L] φ(x, a) = [ϕ 1 (x, a)... ϕ d (x, a)] A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-12/82

Linear Fitted Q-iteration: the Samples Assumption: access to a generative model, that is a black-box simulator sim() of the environment is available. Given (x, a), sim(x, a) = {y, r}, with y p( x, a), r = r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-13/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) Return π K ( ) = arg max a QK (, a) (greedy policy) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-14/82

Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-15/82

Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) In practice it can be done once before running the algorithm The sampling distribution ρ should cover the state-action space in all relevant regions If not possible to choose ρ, a database of samples can be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-15/82

Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-16/82

Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 Each sample y i is an unbiased sample, since E[y i x i, a i ] = E[r i + γ max Q k 1 (x i, a)] = r(x i, a i ) + γe[max Q k 1 (x i, a)] a a = r(x i, a i ) + γ max Q k 1 (x, a)p(dy x, a) = T Q k 1 (x i, a i ) a The problem reduces to standard regression It should be recomputed at each iteration X A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-16/82

Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-17/82

Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) Thanks to the linear space we can solve it as Build matrix Φ = [ φ(x1, a 1 )... φ(x n, a n ) ] Compute ˆα k = (Φ Φ) 1 Φ y (least squares solution) Truncation to [ V max ; V max ] (with V max = R max /(1 γ)) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-17/82

Sketch of the Analysis Q 0 T T Q 1 ǫ 1 Q 1 Q 2 T Q 1 ǫ 2 Q 2 T Q 3 T Q 2 ǫ 3 Q 3 T T Q 4 T Q 3 Q K Q final error Q πk greedy π K Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-18/82

Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-19/82

Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-19/82

Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? Sub-Objective 2: analyze how the error at each iteration is propagated through iterations Q Q π K µ propagation( T Q k 1 Q k ρ ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-19/82

The Sources of Error Desired solution Q k = T Q k 1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-20/82

The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-20/82

The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-20/82

The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-20/82

The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 Error from the (random) samples A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-20/82

Per-Iteration Error Theorem At each iteration k, Linear-FQI returns an approximation Q k such that (Sub-Objective 1) Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max, n with probability 1 δ. Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces,... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-21/82

Per-Iteration Error Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-22/82

Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n No algorithm can do better Constant 4 Depends on the space F Changes with the iteration k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-23/82

Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the features (L) and on the best solution ( α k ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-24/82

Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the dimensionality of the space (d) and the number of samples (n) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-25/82

Error Propagation Objective Q Q π K µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-26/82

Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-26/82

Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-26/82

Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k Problem 3: we have bounds for one single iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-26/82

Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-27/82

Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m Average (discounted) concentration C µ,ρ = (1 γ) 2 m 1 mγ m 1 c(m) < + A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-27/82

Error Propagation Remark: relationship to top-lyapunov exponent L + 1 = sup lim sup ( π m m log+ ρp π1 P π2 P πm ) If L + 0 (stable system), then c(m) has a growth rate which is polynomial and C µ,ρ < is finite A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-28/82

Error Propagation Proposition Let ɛ k = Q k Q k be the propagation error at each iteration, then after K iteration the performance loss of the greedy policy π K is [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-29/82

The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-30/82

The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 ɛ k ρ = Q k Q k ρ 4 Q k f α k ρ ( + O (Vmax + L αk ) ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-30/82

The Final Bound Theorem (see e.g., Munos, 03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-31/82

The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The propagation (and different norms) makes the problem more complex how do we choose the sampling distribution? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-32/82

The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The approximation error is worse than in regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-33/82

The Final Bound The inherent Bellman error Q k f α k ρ = inf f F Q k f ρ = inf f F T Q k 1 f ρ inf f F T f α k 1 f ρ sup inf T g f ρ = d(f, T F) f F g F Question: how to design F to make it compatible with the Bellman operator? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-34/82

The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The dependency on γ is worse than at each iteration is it possible to avoid it? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-35/82

The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The error decreases exponentially in K K ɛ/(1 γ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-36/82

The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The smallest eigenvalue of the Gram matrix design the features so as to be orthogonal w.r.t. ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-37/82

The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The asymptotic rate O(d/n) is the same as for regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-38/82

Summary Q k Q k Approximation algorithm Propagation Dynamic programming algorithm Samples (sampling strategy, number) number n, sampling dist. ρ Performance Markov decision process Concentrability C µ,ρ Range V max Approximation space d(f, T F) size d, features ω A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-39/82

Other implementations Replace the regression step with K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector regression... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-40/82

Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-41/82

Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-41/82

Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-41/82

Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-41/82

Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-41/82

Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-42/82

Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-42/82

Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 60 Management cost 70 60 Value function 50 50 40 40 30 30 20 10 wear 0 0 1 2 3 4 5 6 7 8 9 10 20 K R K R K R 10 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-42/82

Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function 60 60 50 50 40 40 30 30 20 10 0 wear 0 1 2 3 4 5 6 7 8 9 10 Linear approximation space F := 20 10 K 0 1 2 3 4 5 6 7 8 9 10 R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-42/82

+ +++ + +++ Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 60 50 +++++++++++++++++ ++++++++ 40 30 + ++++++++++++++++++++ 20 + ++++++++++++++++++++ 10 ++++++++++++++++++++++++ + 0 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-43/82

+ +++ + +++ Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 70 60 60 50 +++++++++++++++++ ++++++++ 50 40 30 + ++++++++++++++++++++ 40 30 20 + ++++++++++++++++++++ 20 10 ++++++++++++++++++++++++ + 0 0 1 2 3 4 5 6 7 8 9 10 10 0 0 1 2 3 4 5 6 7 8 9 10 Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-43/82

+ +++ ++++ Example: the Optimal Replacement Problem 70 70 70 60 50 + ++++++++++++++++++++ +++++++++++++++++++++++++ 60 50 60 50 40 30 + ++++++++++++++++++++ 40 30 40 20 10 + ++++++++++++++++++++++++ 20 10 30 20 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 10 Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-44/82

Example: the Optimal Replacement Problem Simulation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-45/82

Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-46/82

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-47/82

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y Problem: how can we approximate V π k? Problem: if V k V π k, does (approx.) policy iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-47/82

Approximate Policy Iteration: performance loss Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k 2γ k }{{} (1 γ) performance loss 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-48/82

Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-49/82

Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 Least-Squares Temporal Difference (LSTD) algorithm for policy evaluation. *In practice we use approximations of action-value functions. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-49/82

Least-Squares Temporal-Difference Learning (LSTD) V π may not belong to F Best approximation of V π in F is V π / F ΠV π = arg min f F V π f (Π is the projection onto F) T π V π F ΠV π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-50/82

Least-Squares Temporal-Difference Learning (LSTD) V π is the fixed-point of T π V π = T π V π = r π + γp π V π LSTD searches for the fixed-point of Π 2,ρ T π Π 2,ρ g = arg min f F g f 2,ρ When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V π T π V TD T π F Π ρ V π V TD = Π ρ T π V TD A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-51/82

Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] T π V TD V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-52/82

Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-52/82

Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 Since V TD F, there exists α TD such that V TD (x) = φ(x) α TD r π, ϕ i ρ r π, ϕ i ρ d (I γp π )ϕ j α TD,j, ϕ i ρ = 0 j=1 d (I γp π )ϕ j, ϕ i ρα TD,j = 0 j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-52/82

Least-Squares Temporal-Difference Learning (LSTD) r π, ϕ i ρ }{{} b i V TD = Π ρ T π V TD d (I γp π )ϕ j, ϕ i ρ α TD,j = 0 }{{} A i,j j=1 Aα TD = b A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-53/82

Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-54/82

Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. Problem: In general, Π ρ T π cannot be computed (because unknown) Solution: Use samples coming from a trajectory of π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-54/82

Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-55/82

Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-55/82

Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-55/82

Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-55/82

Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-55/82

Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-55/82

Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Return the last policy π K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-55/82

Least-Squares Policy Iteration (LSPI) 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) The first few samples may be discarded because not actually drawn from the stationary distribution ρ π k Off-policy samples could be used with importance weighting In practice i.i.d. states drawn from an arbitrary distribution (but with actions π k ) may be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-56/82

Least-Squares Policy Iteration (LSPI) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Computing the greedy policy from V k is difficult, so move to LSTD-Q and compute π k+1 (x) = arg max Q k (x, a) a A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-57/82

Least-Squares Policy Iteration (LSPI) For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-58/82

Least-Squares Policy Iteration (LSPI) For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k... (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Problem: This process may be unstable because π k does not cover the state space properly Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-58/82

LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-59/82

LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-59/82

LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... Problem 2: V TD is a fixed point solution and not a standard machine learning problem... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-59/82

LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-60/82

LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. Theorem (LSTD Error Bound) At any iteration k, if LSTD uses n samples obtained from a single trajectory of π and a d-dimensional space, then with probability 1 δ V π k V k ρ π k ( ) c inf V π k d log(d/δ) f ρ π k + O 1 γ 2 f F n ν A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-60/82

LSTD Error Bound V π V ρ π c 1 γ 2 inf V π f ρ π + O f F }{{} approximation error ( ) d log(d/δ) n ν } {{ } estimation error Approximation error: it depends on how well the function space F can approximate the value function V π Estimation error: it depends on the number of samples n, the dim of the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-61/82

LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error n number of samples and d dimensionality A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-62/82

LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error ν k = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ π k ) i,j (Assumption: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) β-mixing coefficients are hidden in the O( ) notation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-63/82

LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 E 0 (F) + O + γ K R max n ν ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-64/82

LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-65/82

LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-66/82

LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K Initialization error: error due to the choice of the initial value function or initial policy V V π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-67/82

LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-68/82

LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. ν ρ = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ) i,j A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-69/82

Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-70/82

Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-71/82

Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-72/82

Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-72/82

BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-73/82

BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-74/82

BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-75/82

BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-76/82

BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system  i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-77/82

BRM: the approximation error Proof. We relate the Bellman residual to the approximation error as V π V = V π T π V + T π V V = γp π (V π V ) + T π V V (I γp π )(V π V ) = T π V V, taking the norm both sides we obtain and V π V BR (I γp π ) 1 T π V BR V BR T π V BR V BR = inf V F T π V V (1 + γ P π ) inf V F V π V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-78/82

BRM: the approximation error Proof. If we consider the stationary distribution µ π, then P π µπ = 1. The matrix (I γp π ) can be written as the power series t γ(pπ ) t. Applying the norm we obtain (I γp π ) 1 µπ t 0 γ t P π t µ π 1 1 γ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-79/82

LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-80/82

Bibliography I A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, 2014-81/82

Reinforcement Learning Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr