Approximate Dynamic Programming

Size: px
Start display at page:

Download "Approximate Dynamic Programming"

Transcription

1 Approximate Dynamic Programming A. LAZARIC (SequeL ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course

2 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

3 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

4 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

5 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

6 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). Can we rely on samples? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

7 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

8 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

9 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). Can we use approximations? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

10 The Objective Find a policy π such that the performance loss V V π is as small as possible A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

11 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

12 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

13 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e. performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

14 From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

15 From Approximation Error to Performance Loss Proof. V V π T V T π V + T π V T π V π T V T V + γ V V π γ V V + γ( V V + V V π ) 2γ 1 γ V V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

16 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

17 From Approximation Error to Performance Loss Question: how do we compute a good V? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

18 From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

19 From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. Solution: value iteration tends to learn functions which are close to the optimal value function V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

20 Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

21 Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). Problem: how can we approximate T Q k? Problem: if Q k+1 T Q k, does (approx.) value iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

22 Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

23 Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) with features F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 ϕ j : X A [0, L] φ(x, a) = [ϕ 1 (x, a)... ϕ d (x, a)] A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

24 Linear Fitted Q-iteration: the Samples Assumption: access to a generative model, that is a black-box simulator sim() of the environment is available. Given (x, a), sim(x, a) = {y, r}, with y p( x, a), r = r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

25 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

26 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

27 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

28 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

29 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

30 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

31 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

32 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

33 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

34 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) Return π K ( ) = arg max a QK (, a) (greedy policy) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

35 Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

36 Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) In practice it can be done once before running the algorithm The sampling distribution ρ should cover the state-action space in all relevant regions If not possible to choose ρ, a database of samples can be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

37 Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

38 Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 Each sample y i is an unbiased sample, since E[y i x i, a i ] = E[r i + γ max Q k 1 (x i, a)] = r(x i, a i ) + γe[max Q k 1 (x i, a)] a a = r(x i, a i ) + γ max Q k 1 (x, a)p(dy x, a) = T Q k 1 (x i, a i ) a The problem reduces to standard regression It should be recomputed at each iteration X A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

39 Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

40 Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) Thanks to the linear space we can solve it as Build matrix Φ = [ φ(x1, a 1 )... φ(x n, a n ) ] Compute ˆα k = (Φ Φ) 1 Φ y (least squares solution) Truncation to [ V max ; V max ] (with V max = R max /(1 γ)) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

41 Sketch of the Analysis Q 0 T T Q 1 ǫ 1 Q 1 Q 2 T Q 1 ǫ 2 Q 2 T Q 3 T Q 2 ǫ 3 Q 3 T T Q 4 T Q 3 Q K Q final error Q πk greedy π K Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

42 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

43 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

44 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? Sub-Objective 2: analyze how the error at each iteration is propagated through iterations Q Q π K µ propagation( T Q k 1 Q k ρ ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

45 The Sources of Error Desired solution Q k = T Q k 1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

46 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

47 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

48 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

49 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 Error from the (random) samples A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

50 Per-Iteration Error Theorem At each iteration k, Linear-FQI returns an approximation Q k such that (Sub-Objective 1) Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max, n with probability 1 δ. Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces,... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

51 Per-Iteration Error Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

52 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n No algorithm can do better Constant 4 Depends on the space F Changes with the iteration k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

53 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the features (L) and on the best solution ( α k ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

54 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the dimensionality of the space (d) and the number of samples (n) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

55 Error Propagation Objective Q Q π K µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

56 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

57 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

58 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k Problem 3: we have bounds for one single iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

59 Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

60 Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m Average (discounted) concentration C µ,ρ = (1 γ) 2 m 1 mγ m 1 c(m) < + A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

61 Error Propagation Remark: relationship to top-lyapunov exponent L + 1 = sup lim sup ( π m m log+ ρp π1 P π2 P πm ) If L + 0 (stable system), then c(m) has a growth rate which is polynomial and C µ,ρ < is finite A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

62 Error Propagation Proposition Let ɛ k = Q k Q k be the propagation error at each iteration, then after K iteration the performance loss of the greedy policy π K is [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

63 The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

64 The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 ɛ k ρ = Q k Q k ρ 4 Q k f α k ρ ( + O (Vmax + L αk ) ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

65 The Final Bound Theorem (see e.g., Munos, 03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

66 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The propagation (and different norms) makes the problem more complex how do we choose the sampling distribution? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

67 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The approximation error is worse than in regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

68 The Final Bound The inherent Bellman error Q k f α k ρ = inf f F Q k f ρ = inf f F T Q k 1 f ρ inf f F T f α k 1 f ρ sup inf T g f ρ = d(f, T F) f F g F Question: how to design F to make it compatible with the Bellman operator? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

69 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The dependency on γ is worse than at each iteration is it possible to avoid it? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

70 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The error decreases exponentially in K K ɛ/(1 γ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

71 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The smallest eigenvalue of the Gram matrix design the features so as to be orthogonal w.r.t. ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

72 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The asymptotic rate O(d/n) is the same as for regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

73 Summary Q k Q k Approximation algorithm Propagation Dynamic programming algorithm Samples (sampling strategy, number) number n, sampling dist. ρ Performance Markov decision process Concentrability C µ,ρ Range V max Approximation space d(f, T F) size d, features ω A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

74 Other implementations Replace the regression step with K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector regression... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

75 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

76 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

77 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

78 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

79 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

80 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

81 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

82 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy Management cost Value function wear K R K R K R A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

83 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function wear Linear approximation space F := K R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

84 Example: the Optimal Replacement Problem Collect N sample on a uniform grid A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

85 Example: the Optimal Replacement Problem Collect N sample on a uniform grid Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

86 Example: the Optimal Replacement Problem Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

87 Example: the Optimal Replacement Problem Simulation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

88 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

89 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

90 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y Problem: how can we approximate V π k? Problem: if V k V π k, does (approx.) policy iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

91 Approximate Policy Iteration: performance loss Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k 2γ k }{{} (1 γ) performance loss 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

92 Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

93 Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 Least-Squares Temporal Difference (LSTD) algorithm for policy evaluation. *In practice we use approximations of action-value functions. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

94 Least-Squares Temporal-Difference Learning (LSTD) V π may not belong to F Best approximation of V π in F is V π / F ΠV π = arg min f F V π f (Π is the projection onto F) T π V π F ΠV π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

95 Least-Squares Temporal-Difference Learning (LSTD) V π is the fixed-point of T π V π = T π V π = r π + γp π V π LSTD searches for the fixed-point of Π 2,ρ T π Π 2,ρ g = arg min f F g f 2,ρ When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V π T π V TD T π F Π ρ V π V TD = Π ρ T π V TD A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

96 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] T π V TD V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

97 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

98 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 Since V TD F, there exists α TD such that V TD (x) = φ(x) α TD r π, ϕ i ρ r π, ϕ i ρ d (I γp π )ϕ j α TD,j, ϕ i ρ = 0 j=1 d (I γp π )ϕ j, ϕ i ρα TD,j = 0 j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

99 Least-Squares Temporal-Difference Learning (LSTD) r π, ϕ i ρ }{{} b i V TD = Π ρ T π V TD d (I γp π )ϕ j, ϕ i ρ α TD,j = 0 }{{} A i,j j=1 Aα TD = b A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

100 Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

101 Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. Problem: In general, Π ρ T π cannot be computed (because unknown) Solution: Use samples coming from a trajectory of π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

102 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

103 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

104 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

105 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

106 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

107 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

108 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Return the last policy π K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

109 Least-Squares Policy Iteration (LSPI) 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) The first few samples may be discarded because not actually drawn from the stationary distribution ρ π k Off-policy samples could be used with importance weighting In practice i.i.d. states drawn from an arbitrary distribution (but with actions π k ) may be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

110 Least-Squares Policy Iteration (LSPI) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Computing the greedy policy from V k is difficult, so move to LSTD-Q and compute π k+1 (x) = arg max Q k (x, a) a A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

111 Least-Squares Policy Iteration (LSPI) For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

112 Least-Squares Policy Iteration (LSPI) For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k... (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Problem: This process may be unstable because π k does not cover the state space properly Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

113 LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

114 LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

115 LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... Problem 2: V TD is a fixed point solution and not a standard machine learning problem... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

116 LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

117 LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. Theorem (LSTD Error Bound) At any iteration k, if LSTD uses n samples obtained from a single trajectory of π and a d-dimensional space, then with probability 1 δ V π k V k ρ π k ( ) c inf V π k d log(d/δ) f ρ π k + O 1 γ 2 f F n ν A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

118 LSTD Error Bound V π V ρ π c 1 γ 2 inf V π f ρ π + O f F }{{} approximation error ( ) d log(d/δ) n ν } {{ } estimation error Approximation error: it depends on how well the function space F can approximate the value function V π Estimation error: it depends on the number of samples n, the dim of the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

119 LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error n number of samples and d dimensionality A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

120 LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error ν k = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ π k ) i,j (Assumption: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) β-mixing coefficients are hidden in the O( ) notation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

121 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 E 0 (F) + O + γ K R max n ν ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

122 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

123 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

124 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K Initialization error: error due to the choice of the initial value function or initial policy V V π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

125 LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

126 LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. ν ρ = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ) i,j A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

127 Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

128 Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

129 Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

130 Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

131 BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

132 BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

133 BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

134 BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

135 BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system  i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

136 BRM: the approximation error Proof. We relate the Bellman residual to the approximation error as V π V = V π T π V + T π V V = γp π (V π V ) + T π V V (I γp π )(V π V ) = T π V V, taking the norm both sides we obtain and V π V BR (I γp π ) 1 T π V BR V BR T π V BR V BR = inf V F T π V V (1 + γ P π ) inf V F V π V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

137 BRM: the approximation error Proof. If we consider the stationary distribution µ π, then P π µπ = 1. The matrix (I γp π ) can be written as the power series t γ(pπ ) t. Applying the norm we obtain (I γp π ) 1 µπ t 0 γ t P π t µ π 1 1 γ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

138 LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

139 Bibliography I A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82

140 Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Analysis of Classification-based Policy Iteration Algorithms

Analysis of Classification-based Policy Iteration Algorithms Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh

More information

Chapter 13 Wow! Least Squares Methods in Batch RL

Chapter 13 Wow! Least Squares Methods in Batch RL Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Matthew W. Hoffman 1, Alessandro Lazaric 2, Mohammad Ghavamzadeh 2, and Rémi Munos 2 1 University of British

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Analysis of Classification-based Policy Iteration Algorithms. Abstract

Analysis of Classification-based Policy Iteration Algorithms. Abstract Analysis of Classification-based Policy Iteration Algorithms Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric Mohammad Ghavamzadeh Rémi Munos INRIA Lille - Nord Europe, Team

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Classification-based Policy Iteration with a Critic

Classification-based Policy Iteration with a Critic Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

A Dantzig Selector Approach to Temporal Difference Learning

A Dantzig Selector Approach to Temporal Difference Learning Matthieu Geist Supélec, IMS Research Group, Metz, France Bruno Scherrer INRIA, MAIA Project Team, Nancy, France Alessandro Lazaric and Mohammad Ghavamzadeh INRIA Lille - Team SequeL, France matthieu.geist@supelec.fr

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

MATH4406 Assignment 5

MATH4406 Assignment 5 MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Least squares policy iteration (LSPI)

Least squares policy iteration (LSPI) Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the

More information

Classification-based Policy Iteration with a Critic

Classification-based Policy Iteration with a Critic Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Robotics. Control Theory. Marc Toussaint U Stuttgart

Robotics. Control Theory. Marc Toussaint U Stuttgart Robotics Control Theory Topics in control theory, optimal control, HJB equation, infinite horizon case, Linear-Quadratic optimal control, Riccati equations (differential, algebraic, discrete-time), controllability,

More information

Laplacian Agent Learning: Representation Policy Iteration

Laplacian Agent Learning: Representation Policy Iteration Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Analyzing Feature Generation for Value-Function Approximation

Analyzing Feature Generation for Value-Function Approximation Ronald Parr Christopher Painter-Wakefield Department of Computer Science, Duke University, Durham, NC 778 USA Lihong Li Michael Littman Department of Computer Science, Rutgers University, Piscataway, NJ

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games Alberto Bressan ) and Khai T. Nguyen ) *) Department of Mathematics, Penn State University **) Department of Mathematics,

More information

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, Michael L. Littman

More information

Understanding (Exact) Dynamic Programming through Bellman Operators

Understanding (Exact) Dynamic Programming through Bellman Operators Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11 Overview 1 Value

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Gradient Estimation for Attractor Networks

Gradient Estimation for Attractor Networks Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Confidence intervals for the mixing time of a reversible Markov chain from a single sample path

Confidence intervals for the mixing time of a reversible Markov chain from a single sample path Confidence intervals for the mixing time of a reversible Markov chain from a single sample path Daniel Hsu Aryeh Kontorovich Csaba Szepesvári Columbia University, Ben-Gurion University, University of Alberta

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Least-Squares Methods for Policy Iteration

Least-Squares Methods for Policy Iteration Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path Machine Learning manuscript No. will be inserted by the editor Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path András Antos 1, Csaba

More information

Fitted Q-iteration in continuous action-space MDPs

Fitted Q-iteration in continuous action-space MDPs Fitted Q-iteration in continuous action-space MDPs András Antos Computer and Automation Research Inst of the Hungarian Academy of Sciences Kende u 13-17, Budapest 1111, Hungary antos@sztakihu Rémi Munos

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Least-Squares Temporal Difference Learning based on Extreme Learning Machine Least-Squares Temporal Difference Learning based on Extreme Learning Machine Pablo Escandell-Montero, José M. Martínez-Martínez, José D. Martín-Guerrero, Emilio Soria-Olivas, Juan Gómez-Sanchis IDAL, Intelligent

More information

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Marek Petrik IBM T.J. Watson Research Center, Yorktown, NY, USA Abstract Approximate dynamic programming is a popular method

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation

Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation Geometric Variance Reduction in arkovchains. Application tovalue Function and Gradient Estimation Rémi unos Centre de athématiques Appliquées, Ecole Polytechnique, 98 Palaiseau Cedex, France. remi.munos@polytechnique.fr

More information

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

More information