Approximate Dynamic Programming

Size: px

Start display at page:

Download "Approximate Dynamic Programming"

Felicia Parsons
5 years ago
Views:

1 Approximate Dynamic Programming A. LAZARIC (SequeL ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course

2 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

3 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

4 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

5 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

6 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). Can we rely on samples? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

7 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

8 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

9 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). Can we use approximations? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

10 The Objective Find a policy π such that the performance loss V V π is as small as possible A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

11 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

12 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

13 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e. performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

14 From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

15 From Approximation Error to Performance Loss Proof. V V π T V T π V + T π V T π V π T V T V + γ V V π γ V V + γ( V V + V V π ) 2γ 1 γ V V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

16 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

17 From Approximation Error to Performance Loss Question: how do we compute a good V? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

18 From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

19 From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. Solution: value iteration tends to learn functions which are close to the optimal value function V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

20 Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

21 Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). Problem: how can we approximate T Q k? Problem: if Q k+1 T Q k, does (approx.) value iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

22 Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

23 Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) with features F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 ϕ j : X A [0, L] φ(x, a) = [ϕ 1 (x, a)... ϕ d (x, a)] A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

24 Linear Fitted Q-iteration: the Samples Assumption: access to a generative model, that is a black-box simulator sim() of the environment is available. Given (x, a), sim(x, a) = {y, r}, with y p( x, a), r = r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

25 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

26 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

27 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

28 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

29 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

30 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

31 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

32 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

33 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

34 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) Return π K ( ) = arg max a QK (, a) (greedy policy) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

35 Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

36 Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) In practice it can be done once before running the algorithm The sampling distribution ρ should cover the state-action space in all relevant regions If not possible to choose ρ, a database of samples can be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

37 Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

38 Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 Each sample y i is an unbiased sample, since E[y i x i, a i ] = E[r i + γ max Q k 1 (x i, a)] = r(x i, a i ) + γe[max Q k 1 (x i, a)] a a = r(x i, a i ) + γ max Q k 1 (x, a)p(dy x, a) = T Q k 1 (x i, a i ) a The problem reduces to standard regression It should be recomputed at each iteration X A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

39 Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

40 Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) Thanks to the linear space we can solve it as Build matrix Φ = [ φ(x1, a 1 )... φ(x n, a n ) ] Compute ˆα k = (Φ Φ) 1 Φ y (least squares solution) Truncation to [ V max ; V max ] (with V max = R max /(1 γ)) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

41 Sketch of the Analysis Q 0 T T Q 1 ǫ 1 Q 1 Q 2 T Q 1 ǫ 2 Q 2 T Q 3 T Q 2 ǫ 3 Q 3 T T Q 4 T Q 3 Q K Q final error Q πk greedy π K Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

42 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

43 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

44 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? Sub-Objective 2: analyze how the error at each iteration is propagated through iterations Q Q π K µ propagation( T Q k 1 Q k ρ ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

45 The Sources of Error Desired solution Q k = T Q k 1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

46 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

47 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

48 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

49 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 Error from the (random) samples A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

50 Per-Iteration Error Theorem At each iteration k, Linear-FQI returns an approximation Q k such that (Sub-Objective 1) Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max, n with probability 1 δ. Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces,... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

51 Per-Iteration Error Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

52 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n No algorithm can do better Constant 4 Depends on the space F Changes with the iteration k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

53 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the features (L) and on the best solution ( α k ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

54 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the dimensionality of the space (d) and the number of samples (n) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

55 Error Propagation Objective Q Q π K µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

56 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

57 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

58 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k Problem 3: we have bounds for one single iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

59 Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

60 Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m Average (discounted) concentration C µ,ρ = (1 γ) 2 m 1 mγ m 1 c(m) < + A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

61 Error Propagation Remark: relationship to top-lyapunov exponent L + 1 = sup lim sup ( π m m log+ ρp π1 P π2 P πm ) If L + 0 (stable system), then c(m) has a growth rate which is polynomial and C µ,ρ < is finite A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

62 Error Propagation Proposition Let ɛ k = Q k Q k be the propagation error at each iteration, then after K iteration the performance loss of the greedy policy π K is [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

63 The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

64 The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 ɛ k ρ = Q k Q k ρ 4 Q k f α k ρ ( + O (Vmax + L αk ) ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

65 The Final Bound Theorem (see e.g., Munos, 03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

66 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The propagation (and different norms) makes the problem more complex how do we choose the sampling distribution? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

67 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The approximation error is worse than in regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

68 The Final Bound The inherent Bellman error Q k f α k ρ = inf f F Q k f ρ = inf f F T Q k 1 f ρ inf f F T f α k 1 f ρ sup inf T g f ρ = d(f, T F) f F g F Question: how to design F to make it compatible with the Bellman operator? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

69 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The dependency on γ is worse than at each iteration is it possible to avoid it? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

70 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The error decreases exponentially in K K ɛ/(1 γ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

71 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The smallest eigenvalue of the Gram matrix design the features so as to be orthogonal w.r.t. ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

72 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The asymptotic rate O(d/n) is the same as for regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

73 Summary Q k Q k Approximation algorithm Propagation Dynamic programming algorithm Samples (sampling strategy, number) number n, sampling dist. ρ Performance Markov decision process Concentrability C µ,ρ Range V max Approximation space d(f, T F) size d, features ω A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

74 Other implementations Replace the regression step with K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector regression... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

75 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

76 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

77 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

78 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

79 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

80 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

81 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

82 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy Management cost Value function wear K R K R K R A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

83 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function wear Linear approximation space F := K R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

84 Example: the Optimal Replacement Problem Collect N sample on a uniform grid A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

85 Example: the Optimal Replacement Problem Collect N sample on a uniform grid Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

86 Example: the Optimal Replacement Problem Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

87 Example: the Optimal Replacement Problem Simulation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

88 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

89 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

90 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y Problem: how can we approximate V π k? Problem: if V k V π k, does (approx.) policy iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

91 Approximate Policy Iteration: performance loss Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k 2γ k }{{} (1 γ) performance loss 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

92 Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

93 Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 Least-Squares Temporal Difference (LSTD) algorithm for policy evaluation. *In practice we use approximations of action-value functions. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

94 Least-Squares Temporal-Difference Learning (LSTD) V π may not belong to F Best approximation of V π in F is V π / F ΠV π = arg min f F V π f (Π is the projection onto F) T π V π F ΠV π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

95 Least-Squares Temporal-Difference Learning (LSTD) V π is the fixed-point of T π V π = T π V π = r π + γp π V π LSTD searches for the fixed-point of Π 2,ρ T π Π 2,ρ g = arg min f F g f 2,ρ When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V π T π V TD T π F Π ρ V π V TD = Π ρ T π V TD A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

96 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] T π V TD V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

97 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

98 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 Since V TD F, there exists α TD such that V TD (x) = φ(x) α TD r π, ϕ i ρ r π, ϕ i ρ d (I γp π )ϕ j α TD,j, ϕ i ρ = 0 j=1 d (I γp π )ϕ j, ϕ i ρα TD,j = 0 j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

99 Least-Squares Temporal-Difference Learning (LSTD) r π, ϕ i ρ }{{} b i V TD = Π ρ T π V TD d (I γp π )ϕ j, ϕ i ρ α TD,j = 0 }{{} A i,j j=1 Aα TD = b A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

100 Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

101 Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. Problem: In general, Π ρ T π cannot be computed (because unknown) Solution: Use samples coming from a trajectory of π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

102 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

103 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

104 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

105 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

106 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

107 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

108 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Return the last policy π K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

109 Least-Squares Policy Iteration (LSPI) 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) The first few samples may be discarded because not actually drawn from the stationary distribution ρ π k Off-policy samples could be used with importance weighting In practice i.i.d. states drawn from an arbitrary distribution (but with actions π k ) may be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

110 Least-Squares Policy Iteration (LSPI) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Computing the greedy policy from V k is difficult, so move to LSTD-Q and compute π k+1 (x) = arg max Q k (x, a) a A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

111 Least-Squares Policy Iteration (LSPI) For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

112 Least-Squares Policy Iteration (LSPI) For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k... (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Problem: This process may be unstable because π k does not cover the state space properly Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

113 LSTD Algorithm When n then Â A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

114 LSTD Algorithm When n then Â A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

115 LSTD Algorithm When n then Â A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... Problem 2: V TD is a fixed point solution and not a standard machine learning problem... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

116 LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

117 LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. Theorem (LSTD Error Bound) At any iteration k, if LSTD uses n samples obtained from a single trajectory of π and a d-dimensional space, then with probability 1 δ V π k V k ρ π k ( ) c inf V π k d log(d/δ) f ρ π k + O 1 γ 2 f F n ν A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

118 LSTD Error Bound V π V ρ π c 1 γ 2 inf V π f ρ π + O f F }{{} approximation error ( ) d log(d/δ) n ν } {{ } estimation error Approximation error: it depends on how well the function space F can approximate the value function V π Estimation error: it depends on the number of samples n, the dim of the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

119 LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error n number of samples and d dimensionality A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

120 LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error ν k = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ π k ) i,j (Assumption: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) β-mixing coefficients are hidden in the O( ) notation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

121 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 E 0 (F) + O + γ K R max n ν ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

122 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

123 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

124 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K Initialization error: error due to the choice of the initial value function or initial policy V V π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

125 LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

126 LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. ν ρ = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ) i,j A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

127 Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

128 Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

129 Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

130 Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

131 BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

132 BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

133 BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

134 BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

135 BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system Â i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

136 BRM: the approximation error Proof. We relate the Bellman residual to the approximation error as V π V = V π T π V + T π V V = γp π (V π V ) + T π V V (I γp π )(V π V ) = T π V V, taking the norm both sides we obtain and V π V BR (I γp π ) 1 T π V BR V BR T π V BR V BR = inf V F T π V V (1 + γ P π ) inf V F V π V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

137 BRM: the approximation error Proof. If we consider the stationary distribution µ π, then P π µπ = 1. The matrix (I γp π ) can be written as the power series t γ(pπ ) t. Applying the norm we obtain (I γp π ) 1 µπ t 0 γ t P π t µ π 1 1 γ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

138 LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

139 Bibliography I A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

140 Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr

Approximate Dynamic Programming

Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement