Approximate Dynamic Programming

Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course

Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/52

Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-3/52

Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-3/52

Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 Problem: what if V k+1 T V k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-3/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-4/52

Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-5/52

Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. Problem: what if V k V π k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-5/52

Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-6/52

Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F Estimation error. If the reward r and dynamics p are unknown, the Bellman operators T and T π cannot be computed exactly estimate the Bellman operators from samples A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-6/52

Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-7/52

Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-8/52

Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-8/52

Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e.??? performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-8/52

Performance Loss From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-9/52

Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-10/52

Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V Objective: given an approximation space F, compute an approximation V which is as close as possible to the best approximation of V in F, i.e. V arg inf f F V f A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-10/52

Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-11/52

Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-12/52

Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute Vk+1 = AT V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-12/52

Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π be a projection operator in L -norm, which corresponds to V k+1 = Π T V k = arg inf V F T V k V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-13/52

Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π is a non-expansion and the joint operator Π T is a contraction. Then there exists a unique fixed point Ṽ = Π T Ṽ which guarantees the convergence of AVI. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-14/52

Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then V V π K 2γ (1 γ) 2 max T V k AT V k 0 k<k }{{} worst approx. error + 2γK+1 1 γ V V 0. }{{} initial error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-15/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 At each iteration compute Q k+1 = Π T Q k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-16/52

Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration compute Q k+1 = Π T Q k Problems: the Π operator cannot be computed efficiently the Bellman operator T is often unknown A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-17/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-18/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. Let µ a distribution over X. We use a projection in L 2,µ -norm onto the space F: Q k+1 = arg min Q F Q T Q k 2 µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-18/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) (unbiased E[Z i X i, A i ] = T Q k (X i, A i )), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-19/52

Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-20/52

Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i Since Q α is a linear function in α, the problem is a simple quadratic minimization problem with closed form solution. i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-20/52

Approximate Value Iteration Other implementations K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector machine A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-21/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-22/52

Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-23/52

Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 60 Management cost 70 60 Value function 50 50 40 40 30 30 20 10 wear 0 0 1 2 3 4 5 6 7 8 9 10 20 K R K R K R 10 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-23/52

Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function 60 60 50 50 40 40 30 30 20 10 0 wear 0 1 2 3 4 5 6 7 8 9 10 Linear approximation space F := 20 10 K 0 1 2 3 4 5 6 7 8 9 10 R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-23/52

+ +++ Approximate Value Iteration + +++ Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 60 50 +++++++++++++++++ ++++++++ 40 30 + ++++++++++++++++++++ 20 + ++++++++++++++++++++ 10 ++++++++++++++++++++++++ + 0 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-24/52

+ +++ Approximate Value Iteration + +++ Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 70 60 60 50 +++++++++++++++++ ++++++++ 50 40 30 + ++++++++++++++++++++ 40 30 20 + ++++++++++++++++++++ 20 10 ++++++++++++++++++++++++ + 0 0 1 2 3 4 5 6 7 8 9 10 10 0 0 1 2 3 4 5 6 7 8 9 10 Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-24/52

+ +++ ++++ Approximate Value Iteration Example: the Optimal Replacement Problem 70 70 70 60 50 + ++++++++++++++++++++ +++++++++++++++++++++++++ 60 50 60 50 40 30 + ++++++++++++++++++++ 40 30 40 20 10 + ++++++++++++++++++++++++ 20 10 30 20 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 10 Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-25/52

Approximate Policy Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-26/52

Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-27/52

Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-27/52

Approximate Policy Iteration Approximate Policy Iteration: performance loss Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k k }{{} performance loss 2γ (1 γ) 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-28/52

Approximate Policy Iteration Linear Temporal-Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-29/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): the algorithm Algorithm Definition Given a linear space F = {V α (x) = d i=1 α iφ i (x), α R d }. Trace vector z R d and parameter vector α R d initialized to zero. Generate a sequence of states (x 0, x 1, x 2,... ) according to π. At each step t, the temporal difference is d t = r(x t, π(x t )) + γv αt (x t+1 ) V αt (x t ) and the parameters are updated as where η t is learning step. α t+1 = α t + η t d t z t, z t+1 = λγz t + φ(x t+1 ), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-30/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Proposition (Tsitsiklis et Van Roy, 1996) Let the learning rate η t satisfy t 0 η t =, and t 0 η 2 t <. We assume that π admits a stationary distribution µ π and that the features (φ i ) 1 k K are linearly independent. There exists a fixed α such that lim t α t = α. Furthermore we obtain V α V π 2,µ π 1 λγ inf }{{} 1 γ V α V π 2,µ π. α }{{} approximation error smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-31/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-32/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! Problem: the bound does not consider the variance (i.e., samples needed for α t to converge to α ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-32/52

Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): implementation Pros: simple to implement, computational cost linear in d. Cons: very sample inefficient, many samples are needed to converge. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-33/52

Approximate Policy Iteration Least-Squares Temporal Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-34/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Recall: V π = T π V π. Intuition: compute V = AT π V. V π F T π T π V T D Π µ V π T π V T D = Π µ T π V T D Focus on the L 2,µ -weighted norm and projection Π µ Π µ g = arg min f F f g µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-35/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-36/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 α TD is the solution of a linear system of order d. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-36/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Algorithm Definition The LSTD solution α TD can be computed by computing the matrix A and vector b defined as A i,j = φ i, φ j γp π φ j µ b i = φ i, r π µ, and then solving the system Aα = b. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-37/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). Solution: use the stationary distribution µ π of policy π, that is µ π P π = µ π, and µ π (y) = x p(y x, π(x))µ π (x) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-38/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Proposition The Bellman operator T π is a contraction in the weighted L 2,µπ -norm. Thus the joint operator Π µπ T π is a contraction and it admits a unique fixed point V TD. Then V π V TD µπ }{{} approximation error 1 1 γ 2 inf V π V µπ. V F }{{} smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-39/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-40/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-40/52

Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb Remark: Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 No need for a generative model. If the chain is ergodic, Â A et ˆb b when n. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-40/52

Approximate Policy Iteration Bellman Residual Minimization Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-41/52

Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-42/52

Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-43/52

Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-44/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-45/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-46/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-47/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-48/52

Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system Â i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-49/52

Approximate Policy Iteration Bellman Residual Minimization LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-50/52

Approximate Policy Iteration Bellman Residual Minimization Bibliography I A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-51/52

Approximate Policy Iteration Bellman Residual Minimization Reinforcement Learning Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr