Approximate Dynamic Programming

Similar documents
Approximate Dynamic Programming

Approximate Dynamic Programming

Markov Decision Processes and Dynamic Programming

Finite-Sample Analysis in Reinforcement Learning

Markov Decision Processes and Dynamic Programming

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning

On the Convergence of Optimistic Policy Iteration

Lecture 7: Value Function Approximation

Reinforcement Learning

Basics of reinforcement learning

Reinforcement Learning

Reinforcement Learning

Reinforcement learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Analysis of Classification-based Policy Iteration Algorithms

ilstd: Eligibility Traces and Convergence Analysis

The Multi-Arm Bandit Framework

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

The convergence limit of the temporal difference learning

Lecture 4: Approximate dynamic programming

Reinforcement Learning

Tutorial on Policy Gradient Methods. Jan Peters

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

MDP Preliminaries. Nan Jiang. February 10, 2019

Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Lecture 1: March 7, 2018

Reinforcement Learning

Learning the Variance of the Reward-To-Go

A Dantzig Selector Approach to Temporal Difference Learning

Reinforcement Learning. George Konidaris

State Space Abstractions for Reinforcement Learning

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization

INTRODUCTION TO MARKOV DECISION PROCESSES

Optimal Stopping Problems

Internet Monetization

Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Convex Optimization. Problem set 2. Due Monday April 26th

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

The Art of Sequential Optimization via Simulations

Chapter 13 Wow! Least Squares Methods in Batch RL

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Reinforcement Learning: An Introduction

A Gentle Introduction to Reinforcement Learning

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544

Temporal difference learning

Real Time Value Iteration and the State-Action Value Function

Lecture 3: Markov Decision Processes

Reinforcement Learning

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

, and rewards and transition matrices as shown below:

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Variance Reduction for Policy Gradient Methods. March 13, 2017

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 8: Policy Gradient

Lecture 9: Policy Gradient II 1

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Least-Squares Methods for Policy Iteration

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Reinforcement Learning

Markov Decision Processes Infinite Horizon Problems

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

Dual Temporal Difference Learning

Markov Decision Processes and Dynamic Programming

6 Basic Convergence Results for RL Algorithms

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning

Preconditioned Temporal Difference Learning

On Average Versus Discounted Reward Temporal-Difference Learning

Central-limit approach to risk-aware Markov decision processes

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation

Lecture 23: Reinforcement Learning

Approximation Methods in Reinforcement Learning

Notes on Reinforcement Learning

Policy Gradient Methods. February 13, 2017

Linear Least-squares Dyna-style Planning

Reinforcement Learning as Classification Leveraging Modern Classifiers

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning In Continuous Time and Space

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning Part 2

Variational Inference via Stochastic Backpropagation

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Q-Learning for Markov Decision Processes*

Transcription:

Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course

Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-3/52

Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-3/52

Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 Problem: what if V k+1 T V k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-3/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-5/52

Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. Problem: what if V k V π k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-5/52

Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-6/52

Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F Estimation error. If the reward r and dynamics p are unknown, the Bellman operators T and T π cannot be computed exactly estimate the Bellman operators from samples A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-6/52

Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-7/52

Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-8/52

Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-8/52

Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e.??? performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-8/52

Performance Loss From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-9/52

Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-10/52

Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-10/52

Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V Objective: given an approximation space F, compute an approximation V which is as close as possible to the best approximation of V in F, i.e. V arg inf f F V f A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-10/52

Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-11/52

Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-12/52

Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute Vk+1 = AT V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-12/52

Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π be a projection operator in L -norm, which corresponds to V k+1 = Π T V k = arg inf V F T V k V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-13/52

Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π is a non-expansion and the joint operator Π T is a contraction. Then there exists a unique fixed point Ṽ = Π T Ṽ which guarantees the convergence of AVI. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-14/52

Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then V V π K 2γ (1 γ) 2 max T V k AT V k 0 k<k }{{} worst approx. error + 2γK+1 1 γ V V 0. }{{} initial error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-15/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 At each iteration compute Q k+1 = Π T Q k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration compute Q k+1 = Π T Q k Problems: the Π operator cannot be computed efficiently the Bellman operator T is often unknown A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-17/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-18/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. Let µ a distribution over X. We use a projection in L 2,µ -norm onto the space F: Q k+1 = arg min Q F Q T Q k 2 µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-18/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) (unbiased E[Z i X i, A i ] = T Q k (X i, A i )), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-20/52

Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i Since Q α is a linear function in α, the problem is a simple quadratic minimization problem with closed form solution. i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-20/52

Approximate Value Iteration Other implementations K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector machine A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-21/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-23/52

Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-23/52

Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 60 Management cost 70 60 Value function 50 50 40 40 30 30 20 10 wear 0 0 1 2 3 4 5 6 7 8 9 10 20 K R K R K R 10 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-23/52

Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function 60 60 50 50 40 40 30 30 20 10 0 wear 0 1 2 3 4 5 6 7 8 9 10 Linear approximation space F := 20 10 K 0 1 2 3 4 5 6 7 8 9 10 R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-23/52

+ +++ Approximate Value Iteration + +++ Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 60 50 +++++++++++++++++ ++++++++ 40 30 + ++++++++++++++++++++ 20 + ++++++++++++++++++++ 10 ++++++++++++++++++++++++ + 0 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-24/52

+ +++ Approximate Value Iteration + +++ Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 70 60 60 50 +++++++++++++++++ ++++++++ 50 40 30 + ++++++++++++++++++++ 40 30 20 + ++++++++++++++++++++ 20 10 ++++++++++++++++++++++++ + 0 0 1 2 3 4 5 6 7 8 9 10 10 0 0 1 2 3 4 5 6 7 8 9 10 Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-24/52

+ +++ ++++ Approximate Value Iteration Example: the Optimal Replacement Problem 70 70 70 60 50 + ++++++++++++++++++++ +++++++++++++++++++++++++ 60 50 60 50 40 30 + ++++++++++++++++++++ 40 30 40 20 10 + ++++++++++++++++++++++++ 20 10 30 20 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 10 Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-25/52

Approximate Policy Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-26/52

Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-27/52

Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-27/52

Approximate Policy Iteration Approximate Policy Iteration: performance loss Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k k }{{} performance loss 2γ (1 γ) 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-28/52

Approximate Policy Iteration Linear Temporal-Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-29/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): the algorithm Algorithm Definition Given a linear space F = {V α (x) = d i=1 α iφ i (x), α R d }. Trace vector z R d and parameter vector α R d initialized to zero. Generate a sequence of states (x 0, x 1, x 2,... ) according to π. At each step t, the temporal difference is d t = r(x t, π(x t )) + γv αt (x t+1 ) V αt (x t ) and the parameters are updated as where η t is learning step. α t+1 = α t + η t d t z t, z t+1 = λγz t + φ(x t+1 ), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-30/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Proposition (Tsitsiklis et Van Roy, 1996) Let the learning rate η t satisfy t 0 η t =, and t 0 η 2 t <. We assume that π admits a stationary distribution µ π and that the features (φ i ) 1 k K are linearly independent. There exists a fixed α such that lim t α t = α. Furthermore we obtain V α V π 2,µ π 1 λγ inf }{{} 1 γ V α V π 2,µ π. α }{{} approximation error smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-31/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-32/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! Problem: the bound does not consider the variance (i.e., samples needed for α t to converge to α ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-32/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): implementation Pros: simple to implement, computational cost linear in d. Cons: very sample inefficient, many samples are needed to converge. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-33/52

Approximate Policy Iteration Least-Squares Temporal Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-34/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Recall: V π = T π V π. Intuition: compute V = AT π V. V π F T π T π V T D Π µ V π T π V T D = Π µ T π V T D Focus on the L 2,µ -weighted norm and projection Π µ Π µ g = arg min f F f g µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-35/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d T π V TD V TD, φ i µ = 0, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-36/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-36/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 α TD is the solution of a linear system of order d. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-36/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Algorithm Definition The LSTD solution α TD can be computed by computing the matrix A and vector b defined as A i,j = φ i, φ j γp π φ j µ b i = φ i, r π µ, and then solving the system Aα = b. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-37/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-38/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). Solution: use the stationary distribution µ π of policy π, that is µ π P π = µ π, and µ π (y) = x p(y x, π(x))µ π (x) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-38/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Proposition The Bellman operator T π is a contraction in the weighted L 2,µπ -norm. Thus the joint operator Π µπ T π is a contraction and it admits a unique fixed point V TD. Then V π V TD µπ }{{} approximation error 1 1 γ 2 inf V π V µπ. V F }{{} smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-39/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-40/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates  ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-40/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb  ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-40/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb Remark: Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 No need for a generative model. If the chain is ergodic, Â A et ˆb b when n. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-40/52

Approximate Policy Iteration Bellman Residual Minimization Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-41/52

Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-42/52

Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-43/52

Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-44/52

Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-44/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-45/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-46/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-47/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-48/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system  i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-49/52

Approximate Policy Iteration Bellman Residual Minimization LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-50/52

Approximate Policy Iteration Bellman Residual Minimization Bibliography I A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-51/52

Approximate Policy Iteration Bellman Residual Minimization Reinforcement Learning Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr