Approximate Dynamic Programming
|
|
- Felicia Parsons
- 5 years ago
- Views:
Transcription
1 Approximate Dynamic Programming A. LAZARIC (SequeL ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course
2 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
3 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
4 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
5 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
6 From DP to ADP Dynamic programming algorithms require an explicit definition of transition probabilities p( x, a) reward function r(x, a) This knowledge is often unavailable (i.e., wind intensity, human-computer-interaction). Can we rely on samples? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
7 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
8 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
9 From DP to ADP Dynamic programming algorithms require an exact representation of value functions and policies This is often impossible since their shape is too complicated (e.g., large or continuous state space). Can we use approximations? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
10 The Objective Find a policy π such that the performance loss V V π is as small as possible A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
11 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
12 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
13 From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e. performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
14 From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
15 From Approximation Error to Performance Loss Proof. V V π T V T π V + T π V T π V π T V T V + γ V V π γ V V + γ( V V + V V π ) 2γ 1 γ V V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
16 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
17 From Approximation Error to Performance Loss Question: how do we compute a good V? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
18 From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
19 From Approximation Error to Performance Loss Question: how do we compute a good V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V. Solution: value iteration tends to learn functions which are close to the optimal value function V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
20 Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
21 Value Iteration: the Idea 1. Let Q 0 be any action-value function 2. At each iteration k = 1, 2,..., K Compute Q k+1 (x, a) = T Q k (x, a) = r(x, a)+ y 3. Return the greedy policy p(y x, a)γ max Q k (y, b) b π K (x) arg max a A Q K (x, a). Problem: how can we approximate T Q k? Problem: if Q k+1 T Q k, does (approx.) value iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
22 Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
23 Linear Fitted Q-iteration: the Approximation Space Linear space (used to approximate action value functions) with features F = { f (x, a) = d α j ϕ j (x, a), α R d} j=1 ϕ j : X A [0, L] φ(x, a) = [ϕ 1 (x, a)... ϕ d (x, a)] A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
24 Linear Fitted Q-iteration: the Samples Assumption: access to a generative model, that is a black-box simulator sim() of the environment is available. Given (x, a), sim(x, a) = {y, r}, with y p( x, a), r = r(x, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
25 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
26 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
27 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
28 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
29 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
30 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
31 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
32 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
33 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
34 Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q 0 F For k = 1,..., K 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) 3. Compute y i = r i + γ max a Qk 1 (x i, a) 4. Build training set {( (x i, a i ), y i )} n i=1 5. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 6. Return Q k = f ˆαk (truncation may be needed) Return π K ( ) = arg max a QK (, a) (greedy policy) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
35 Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
36 Linear Fitted Q-iteration: Sampling 1. Draw n samples (x i, a i ) i.i.d ρ 2. Sample x i p( x i, a i ) and r i = r(x i, a i ) In practice it can be done once before running the algorithm The sampling distribution ρ should cover the state-action space in all relevant regions If not possible to choose ρ, a database of samples can be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
37 Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
38 Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a Qk 1 (x i, a) 5. Build training set {( (x i, a i ), y i )} n i=1 Each sample y i is an unbiased sample, since E[y i x i, a i ] = E[r i + γ max Q k 1 (x i, a)] = r(x i, a i ) + γe[max Q k 1 (x i, a)] a a = r(x i, a i ) + γ max Q k 1 (x, a)p(dy x, a) = T Q k 1 (x i, a i ) a The problem reduces to standard regression It should be recomputed at each iteration X A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
39 Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
40 Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem f ˆαk 1 = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 7. Return Q k = f ˆαk (truncation may be needed) Thanks to the linear space we can solve it as Build matrix Φ = [ φ(x1, a 1 )... φ(x n, a n ) ] Compute ˆα k = (Φ Φ) 1 Φ y (least squares solution) Truncation to [ V max ; V max ] (with V max = R max /(1 γ)) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
41 Sketch of the Analysis Q 0 T T Q 1 ǫ 1 Q 1 Q 2 T Q 1 ǫ 2 Q 2 T Q 3 T Q 2 ǫ 3 Q 3 T T Q 4 T Q 3 Q K Q final error Q πk greedy π K Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
42 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
43 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
44 Theoretical Objectives Objective: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ Q Q π K µ??? Sub-Objective 1: derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ T Q k 1 Q k ρ??? Sub-Objective 2: analyze how the error at each iteration is propagated through iterations Q Q π K µ propagation( T Q k 1 Q k ρ ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
45 The Sources of Error Desired solution Q k = T Q k 1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
46 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
47 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
48 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
49 The Sources of Error Desired solution Q k = T Q k 1 Best solution (wrt sampling distribution ρ) f α k = arg inf f α F f α Q k ρ Error from the approximation space F Returned solution 1 fˆαk = arg min f α F n n ( ) 2 fα (x i, a i ) y i i=1 Error from the (random) samples A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
50 Per-Iteration Error Theorem At each iteration k, Linear-FQI returns an approximation Q k such that (Sub-Objective 1) Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max, n with probability 1 δ. Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces,... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
51 Per-Iteration Error Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
52 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n No algorithm can do better Constant 4 Depends on the space F Changes with the iteration k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
53 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the features (L) and on the best solution ( α k ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
54 Per-Iteration Error Remarks Q k Q k ρ 4 Q k f α k ρ ( ) + O (Vmax + L αk ) log 1/δ n ( ) d log n/δ + O V max n Vanishing to zero as O(n 1/2 ) Depends on the dimensionality of the space (d) and the number of samples (n) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
55 Error Propagation Objective Q Q π K µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
56 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
57 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
58 Error Propagation Objective Q Q π K µ Problem 1: the test norm µ is different from the sampling norm ρ Problem 2: we have bounds for Q k not for the performance of the corresponding π k Problem 3: we have bounds for one single iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
59 Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
60 Error Propagation Transition kernel for a fixed policy P π. m-step (worst-case) concentration of future state distribution d(µp π1... P πm ) c(m) = sup < dρ π 1...π m Average (discounted) concentration C µ,ρ = (1 γ) 2 m 1 mγ m 1 c(m) < + A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
61 Error Propagation Remark: relationship to top-lyapunov exponent L + 1 = sup lim sup ( π m m log+ ρp π1 P π2 P πm ) If L + 0 (stable system), then c(m) has a growth rate which is polynomial and C µ,ρ < is finite A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
62 Error Propagation Proposition Let ɛ k = Q k Q k be the propagation error at each iteration, then after K iteration the performance loss of the greedy policy π K is [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
63 The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
64 The Final Bound Bringing everything together... [ Q Q π K 2 µ 2γ (1 γ) 2 ] 2 ( ) γ C µ,ρ max ɛ k 2 K ρ + O k (1 γ) 3 V max 2 ɛ k ρ = Q k Q k ρ 4 Q k f α k ρ ( + O (Vmax + L αk ) ) log 1/δ n ( ) d log n/δ + O V max n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
65 The Final Bound Theorem (see e.g., Munos, 03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
66 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The propagation (and different norms) makes the problem more complex how do we choose the sampling distribution? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
67 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The approximation error is worse than in regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
68 The Final Bound The inherent Bellman error Q k f α k ρ = inf f F Q k f ρ = inf f F T Q k 1 f ρ inf f F T f α k 1 f ρ sup inf T g f ρ = d(f, T F) f F g F Question: how to design F to make it compatible with the Bellman operator? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
69 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The dependency on γ is worse than at each iteration is it possible to avoid it? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
70 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The error decreases exponentially in K K ɛ/(1 γ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
71 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The smallest eigenvalue of the Gram matrix design the features so as to be orthogonal w.r.t. ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
72 The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that ( Q Q π K µ 2γ ( ( L ) ) ) d log n/δ Cµ,ρ (1 γ) 2 4d(F, T F) + O V max 1 + ω n ( γ K ) + O (1 γ) 3 Vmax2 The asymptotic rate O(d/n) is the same as for regression A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
73 Summary Q k Q k Approximation algorithm Propagation Dynamic programming algorithm Samples (sampling strategy, number) number n, sampling dist. ρ Performance Markov decision process Concentrability C µ,ρ Range V max Approximation space d(f, T F) size d, features ω A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
74 Other implementations Replace the regression step with K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector regression... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
75 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
76 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
77 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
78 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
79 Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
80 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
81 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
82 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy Management cost Value function wear K R K R K R A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
83 Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function wear Linear approximation space F := K R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
84 Example: the Optimal Replacement Problem Collect N sample on a uniform grid A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
85 Example: the Optimal Replacement Problem Collect N sample on a uniform grid Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
86 Example: the Optimal Replacement Problem Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
87 Example: the Optimal Replacement Problem Simulation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
88 Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
89 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
90 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y Problem: how can we approximate V π k? Problem: if V k V π k, does (approx.) policy iteration still work? A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
91 Approximate Policy Iteration: performance loss Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k 2γ k }{{} (1 γ) performance loss 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
92 Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
93 Least-Squares Policy Iteration (LSPI) LSPI uses Linear space to approximate value functions* { d F = f (x) = α j ϕ j (x), α R d} j=1 Least-Squares Temporal Difference (LSTD) algorithm for policy evaluation. *In practice we use approximations of action-value functions. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
94 Least-Squares Temporal-Difference Learning (LSTD) V π may not belong to F Best approximation of V π in F is V π / F ΠV π = arg min f F V π f (Π is the projection onto F) T π V π F ΠV π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
95 Least-Squares Temporal-Difference Learning (LSTD) V π is the fixed-point of T π V π = T π V π = r π + γp π V π LSTD searches for the fixed-point of Π 2,ρ T π Π 2,ρ g = arg min f F g f 2,ρ When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V π T π V TD T π F Π ρ V π V TD = Π ρ T π V TD A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
96 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] T π V TD V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
97 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
98 Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1,..., ϕ d E x ρ [ (T π V TD (x) V TD (x))ϕ i (x) ] = 0, i [1, d] By definition of Bellman operator T π V TD V TD, ϕ i ρ = 0 r π + γp π V TD V TD, ϕ i ρ = 0 r π, ϕ i ρ (I γp π )V TD, ϕ i ρ = 0 Since V TD F, there exists α TD such that V TD (x) = φ(x) α TD r π, ϕ i ρ r π, ϕ i ρ d (I γp π )ϕ j α TD,j, ϕ i ρ = 0 j=1 d (I γp π )ϕ j, ϕ i ρα TD,j = 0 j=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
99 Least-Squares Temporal-Difference Learning (LSTD) r π, ϕ i ρ }{{} b i V TD = Π ρ T π V TD d (I γp π )ϕ j, ϕ i ρ α TD,j = 0 }{{} A i,j j=1 Aα TD = b A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
100 Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
101 Least-Squares Temporal-Difference Learning (LSTD) Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. Solution: If ρ = ρ π (stationary dist. of π) then Π ρ πt π has a unique fixed-point. Problem: In general, Π ρ T π cannot be computed (because unknown) Solution: Use samples coming from a trajectory of π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
102 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
103 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
104 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
105 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
106 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
107 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
108 Least-Squares Policy Iteration (LSPI) Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π 0 For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 2. Compute the empirical matrix Âk and the vector b k [Âk] i,j = 1 n (ϕ j (x t ) γϕ j (x t+1 )ϕ i (x t ) (I γp π )ϕ j, ϕ i ρ π k n [ b k ] i = 1 n t=1 n ϕ i (x t )r t r π, ϕ i ρ π k t=1 3. Solve the linear system α k = Â 1 b k k 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Return the last policy π K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
109 Least-Squares Policy Iteration (LSPI) 1. Generate a trajectory of length n from the stationary dist. ρ π k (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) The first few samples may be discarded because not actually drawn from the stationary distribution ρ π k Off-policy samples could be used with importance weighting In practice i.i.d. states drawn from an arbitrary distribution (but with actions π k ) may be used A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
110 Least-Squares Policy Iteration (LSPI) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Computing the greedy policy from V k is difficult, so move to LSTD-Q and compute π k+1 (x) = arg max Q k (x, a) a A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
111 Least-Squares Policy Iteration (LSPI) For k = 1,..., K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
112 Least-Squares Policy Iteration (LSPI) For k = 1,..., K 1. Generate a trajectory of length n from the stationary dist. ρ π k... (x 1, π k (x 1 ), r 1, x 2, π k (x 2 ), r 2,..., x n 1, π k (x n 1 ), r n 1, x n ) 4. Compute the greedy policy π k+1 w.r.t. Vk = f αk Problem: This process may be unstable because π k does not cover the state space properly Skip Theory A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
113 LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
114 LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
115 LSTD Algorithm When n then  A and b b, and thus, Proposition (LSTD Performance) α TD α TD and V TD V TD If LSTD is used to estimate the value of π with an infinite number of samples drawn from the stationary distribution ρ π then V π V TD ρ π 1 inf V π V ρ π 1 γ 2 V F Problem: we don t have an infinite number of samples... Problem 2: V TD is a fixed point solution and not a standard machine learning problem... A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
116 LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
117 LSTD Error Bound Assumption: The Markov chain induced by the policy π k has a stationary distribution ρ π k and it is ergodic and β-mixing. Theorem (LSTD Error Bound) At any iteration k, if LSTD uses n samples obtained from a single trajectory of π and a d-dimensional space, then with probability 1 δ V π k V k ρ π k ( ) c inf V π k d log(d/δ) f ρ π k + O 1 γ 2 f F n ν A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
118 LSTD Error Bound V π V ρ π c 1 γ 2 inf V π f ρ π + O f F }{{} approximation error ( ) d log(d/δ) n ν } {{ } estimation error Approximation error: it depends on how well the function space F can approximate the value function V π Estimation error: it depends on the number of samples n, the dim of the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O) A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
119 LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error n number of samples and d dimensionality A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
120 LSTD Error Bound V π k V k ρ π k c 1 γ 2 inf f F V π k f ρ π k }{{} approximation error + O d log(d/δ) n ν k } {{ } estimation error ν k = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ π k ) i,j (Assumption: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) β-mixing coefficients are hidden in the O( ) notation A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
121 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 E 0 (F) + O + γ K R max n ν ρ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
122 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
123 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
124 LSPI Error Bound Theorem (LSPI Error Bound) If LSPI is run over K iterations, then the performance loss policy π K is V V π K µ with probability 1 δ. { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Approximation error: E 0(F) = sup π G( F) inf f F V π f ρ π Estimation error: depends on n, d, ν ρ, K Initialization error: error due to the choice of the initial value function or initial policy V V π 0 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
125 LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
126 LSPI Error Bound LSPI Error Bound V V π K µ { [ ( )] } 4γ d log(dk/δ) CCµ,ρ (1 γ) 2 ce 0 (F) + O + γ K R max n ν ρ Lower-Bounding Distribution There exists a distribution ρ such that for any policy π G( F), we have ρ Cρ π, where C < is a constant and ρ π is the stationary distribution of π. Furthermore, we can define the concentrability coefficient C µ,ρ as before. ν ρ = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j dρ) i,j A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
127 Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
128 Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
129 Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
130 Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
131 BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
132 BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
133 BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
134 BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
135 BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system  i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
136 BRM: the approximation error Proof. We relate the Bellman residual to the approximation error as V π V = V π T π V + T π V V = γp π (V π V ) + T π V V (I γp π )(V π V ) = T π V V, taking the norm both sides we obtain and V π V BR (I γp π ) 1 T π V BR V BR T π V BR V BR = inf V F T π V V (1 + γ P π ) inf V F V π V. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
137 BRM: the approximation error Proof. If we consider the stationary distribution µ π, then P π µπ = 1. The matrix (I γp π ) can be written as the power series t γ(pπ ) t. Applying the norm we obtain (I γp π ) 1 µπ t 0 γ t P π t µ π 1 1 γ A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
138 LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
139 Bibliography I A. LAZARIC Reinforcement Learning Algorithms Dec 2nd, /82
140 Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr
Approximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationAnalysis of Classification-based Policy Iteration Algorithms
Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh
More informationChapter 13 Wow! Least Squares Methods in Batch RL
Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationRegularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization
Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Matthew W. Hoffman 1, Alessandro Lazaric 2, Mohammad Ghavamzadeh 2, and Rémi Munos 2 1 University of British
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationReinforcement Learning
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationAnalysis of Classification-based Policy Iteration Algorithms. Abstract
Analysis of Classification-based Policy Iteration Algorithms Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric Mohammad Ghavamzadeh Rémi Munos INRIA Lille - Nord Europe, Team
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationClassification-based Policy Iteration with a Critic
Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationA Dantzig Selector Approach to Temporal Difference Learning
Matthieu Geist Supélec, IMS Research Group, Metz, France Bruno Scherrer INRIA, MAIA Project Team, Nancy, France Alessandro Lazaric and Mohammad Ghavamzadeh INRIA Lille - Team SequeL, France matthieu.geist@supelec.fr
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationMATH4406 Assignment 5
MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More informationLeast squares policy iteration (LSPI)
Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the
More informationClassification-based Policy Iteration with a Critic
Victor Gabillon Alessandro Lazaric Mohammad Ghavamzadeh IRIA Lille - ord Europe, Team SequeL, FRACE Bruno Scherrer IRIA ancy - Grand Est, Team Maia, FRACE Abstract In this paper, we study the effect of
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationFinite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results
Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationRobotics. Control Theory. Marc Toussaint U Stuttgart
Robotics Control Theory Topics in control theory, optimal control, HJB equation, infinite horizon case, Linear-Quadratic optimal control, Riccati equations (differential, algebraic, discrete-time), controllability,
More informationLaplacian Agent Learning: Representation Policy Iteration
Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationAnalyzing Feature Generation for Value-Function Approximation
Ronald Parr Christopher Painter-Wakefield Department of Computer Science, Duke University, Durham, NC 778 USA Lihong Li Michael Littman Department of Computer Science, Rutgers University, Piscataway, NJ
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationTutorial on Policy Gradient Methods. Jan Peters
Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup
More informationLeast-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems
: Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationStability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games
Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games Alberto Bressan ) and Khai T. Nguyen ) *) Department of Mathematics, Penn State University **) Department of Mathematics,
More informationAn Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning
An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, Michael L. Littman
More informationUnderstanding (Exact) Dynamic Programming through Bellman Operators
Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11 Overview 1 Value
More informationOptimal Stopping Problems
2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating
More informationGradient Estimation for Attractor Networks
Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationHidden Markov Models (HMM) and Support Vector Machine (SVM)
Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationLinear Least-squares Dyna-style Planning
Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationConfidence intervals for the mixing time of a reversible Markov chain from a single sample path
Confidence intervals for the mixing time of a reversible Markov chain from a single sample path Daniel Hsu Aryeh Kontorovich Csaba Szepesvári Columbia University, Ben-Gurion University, University of Alberta
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationCentral-limit approach to risk-aware Markov decision processes
Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory
More informationKernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1
Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationJun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544
Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationAbstract Dynamic Programming
Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"
More informationApproximate dynamic programming for stochastic reachability
Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic
More informationNotes on Tabular Methods
Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from
More informationConvergence of Synchronous Reinforcement Learning. with linear function approximation
Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht
More informationValue and Policy Iteration
Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite
More informationLeast-Squares Methods for Policy Iteration
Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control
ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu
More informationLearning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
Machine Learning manuscript No. will be inserted by the editor Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path András Antos 1, Csaba
More informationFitted Q-iteration in continuous action-space MDPs
Fitted Q-iteration in continuous action-space MDPs András Antos Computer and Automation Research Inst of the Hungarian Academy of Sciences Kende u 13-17, Budapest 1111, Hungary antos@sztakihu Rémi Munos
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationMarkov Decision Processes and Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationLeast-Squares Temporal Difference Learning based on Extreme Learning Machine
Least-Squares Temporal Difference Learning based on Extreme Learning Machine Pablo Escandell-Montero, José M. Martínez-Martínez, José D. Martín-Guerrero, Emilio Soria-Olivas, Juan Gómez-Sanchis IDAL, Intelligent
More informationApproximate Dynamic Programming By Minimizing Distributionally Robust Bounds
Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Marek Petrik IBM T.J. Watson Research Center, Yorktown, NY, USA Abstract Approximate dynamic programming is a popular method
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationReview: Support vector machines. Machine learning techniques and image analysis
Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More information1 Stochastic Dynamic Programming
1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future
More informationUsing first-order logic, formalize the following knowledge:
Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationGeometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation
Geometric Variance Reduction in arkovchains. Application tovalue Function and Gradient Estimation Rémi unos Centre de athématiques Appliquées, Ecole Polytechnique, 98 Palaiseau Cedex, France. remi.munos@polytechnique.fr
More informationJournal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems
Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
More information