Approximate Dynamic Programming

Size: px

Start display at page:

Download "Approximate Dynamic Programming"

Reynold Allison
5 years ago
Views:

1 Approximate Dynamic Programming A. LAZARIC (SequeL Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course

2 Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

3 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

4 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

5 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

6 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

7 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

8 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 Problem: what if V k+1 T V k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

9 Policy Iteration: the Idea 1. Let π 0 be any stationary policy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

10 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

11 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

12 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

13 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

14 Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

15 Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. Problem: what if V k V π k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

16 Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

17 Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F Estimation error. If the reward r and dynamics p are unknown, the Bellman operators T and T π cannot be computed exactly estimate the Bellman operators from samples A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

18 Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

19 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

20 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

21 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e.??? performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

22 Performance Loss From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

23 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

24 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

25 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V Objective: given an approximation space F, compute an approximation V which is as close as possible to the best approximation of V in F, i.e. V arg inf f F V f A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

26 Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

27 Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

28 Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute Vk+1 = AT V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

29 Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π be a projection operator in L -norm, which corresponds to V k+1 = Π T V k = arg inf V F T V k V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

30 Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π is a non-expansion and the joint operator Π T is a contraction. Then there exists a unique fixed point Ṽ = Π T Ṽ which guarantees the convergence of AVI. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

31 Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then V V π K 2γ (1 γ) 2 max T V k AT V k 0 k<k }{{} worst approx. error + 2γK+1 1 γ V V 0. }{{} initial error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

32 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

33 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

34 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

35 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

36 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 At each iteration compute Q k+1 = Π T Q k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

37 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration compute Q k+1 = Π T Q k Problems: the Π operator cannot be computed efficiently the Bellman operator T is often unknown A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

38 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

39 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. Let µ a distribution over X. We use a projection in L 2,µ -norm onto the space F: Q k+1 = arg min Q F Q T Q k 2 µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

40 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

41 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

42 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

43 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

44 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) (unbiased E[Z i X i, A i ] = T Q k (X i, A i )), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

45 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

46 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i Since Q α is a linear function in α, the problem is a simple quadratic minimization problem with closed form solution. i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

47 Approximate Value Iteration Other implementations K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector machine A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

48 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

49 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

50 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

51 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

52 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

53 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

54 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

55 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy Management cost Value function wear K R K R K R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

56 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function wear Linear approximation space F := K R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

57 + +++ Approximate Value Iteration Example: the Optimal Replacement Problem Collect N sample on a uniform grid A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

58 + +++ Approximate Value Iteration Example: the Optimal Replacement Problem Collect N sample on a uniform grid Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

59 Approximate Value Iteration Example: the Optimal Replacement Problem Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

60 Approximate Policy Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

61 Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

62 Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

63 Approximate Policy Iteration Approximate Policy Iteration: performance loss Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k k }{{} performance loss 2γ (1 γ) 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

64 Approximate Policy Iteration Linear Temporal-Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

65 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): the algorithm Algorithm Definition Given a linear space F = {V α (x) = d i=1 α iφ i (x), α R d }. Trace vector z R d and parameter vector α R d initialized to zero. Generate a sequence of states (x 0, x 1, x 2,... ) according to π. At each step t, the temporal difference is d t = r(x t, π(x t )) + γv αt (x t+1 ) V αt (x t ) and the parameters are updated as where η t is learning step. α t+1 = α t + η t d t z t, z t+1 = λγz t + φ(x t+1 ), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

66 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Proposition (Tsitsiklis et Van Roy, 1996) Let the learning rate η t satisfy t 0 η t =, and t 0 η 2 t <. We assume that π admits a stationary distribution µ π and that the features (φ i ) 1 k K are linearly independent. There exists a fixed α such that lim t α t = α. Furthermore we obtain V α V π 2,µ π 1 λγ inf }{{} 1 γ V α V π 2,µ π. α }{{} approximation error smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

67 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

68 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! Problem: the bound does not consider the variance (i.e., samples needed for α t to converge to α ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

69 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): implementation Pros: simple to implement, computational cost linear in d. Cons: very sample inefficient, many samples are needed to converge. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

70 Approximate Policy Iteration Least-Squares Temporal Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

71 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Recall: V π = T π V π. Intuition: compute V = AT π V. V π F T π T π V T D Π µ V π T π V T D = Π µ T π V T D Focus on the L 2,µ -weighted norm and projection Π µ Π µ g = arg min f F f g µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

72 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d T π V TD V TD, φ i µ = 0, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

73 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

74 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 α TD is the solution of a linear system of order d. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

75 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Algorithm Definition The LSTD solution α TD can be computed by computing the matrix A and vector b defined as A i,j = φ i, φ j γp π φ j µ b i = φ i, r π µ, and then solving the system Aα = b. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

76 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

77 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). Solution: use the stationary distribution µ π of policy π, that is µ π P π = µ π, and µ π (y) = x p(y x, π(x))µ π (x) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

78 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Proposition The Bellman operator T π is a contraction in the weighted L 2,µπ -norm. Thus the joint operator Π µπ T π is a contraction and it admits a unique fixed point V TD. Then V π V TD µπ }{{} approximation error 1 1 γ 2 inf V π V µπ. V F }{{} smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

79 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

80 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

81 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

82 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb Remark: Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 No need for a generative model. If the chain is ergodic, Â A et ˆb b when n. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

83 Approximate Policy Iteration Bellman Residual Minimization Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

84 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

85 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

86 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

87 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

88 Approximate Policy Iteration Bellman Residual Minimization BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

89 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

90 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

91 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

92 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system Â i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

93 Approximate Policy Iteration Bellman Residual Minimization LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

94 Approximate Policy Iteration Bellman Residual Minimization Bibliography I A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

95 Approximate Policy Iteration Bellman Residual Minimization Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr

Approximate Dynamic Programming

Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.