Approximate Dynamic Programming
|
|
- Reynold Allison
- 5 years ago
- Views:
Transcription
1 Approximate Dynamic Programming A. LAZARIC (SequeL Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course
2 Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
3 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
4 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
5 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
6 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
7 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
8 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 Problem: what if V k+1 T V k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
9 Policy Iteration: the Idea 1. Let π 0 be any stationary policy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
10 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
11 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
12 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
13 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
14 Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
15 Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. Problem: what if V k V π k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
16 Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
17 Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F Estimation error. If the reward r and dynamics p are unknown, the Bellman operators T and T π cannot be computed exactly estimate the Bellman operators from samples A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
18 Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
19 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
20 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
21 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e.??? performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
22 Performance Loss From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
23 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
24 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
25 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V Objective: given an approximation space F, compute an approximation V which is as close as possible to the best approximation of V in F, i.e. V arg inf f F V f A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
26 Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
27 Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
28 Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute Vk+1 = AT V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
29 Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π be a projection operator in L -norm, which corresponds to V k+1 = Π T V k = arg inf V F T V k V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
30 Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π is a non-expansion and the joint operator Π T is a contraction. Then there exists a unique fixed point Ṽ = Π T Ṽ which guarantees the convergence of AVI. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
31 Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then V V π K 2γ (1 γ) 2 max T V k AT V k 0 k<k }{{} worst approx. error + 2γK+1 1 γ V V 0. }{{} initial error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
32 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
33 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
34 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
35 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
36 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 At each iteration compute Q k+1 = Π T Q k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
37 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration compute Q k+1 = Π T Q k Problems: the Π operator cannot be computed efficiently the Bellman operator T is often unknown A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
38 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
39 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. Let µ a distribution over X. We use a projection in L 2,µ -norm onto the space F: Q k+1 = arg min Q F Q T Q k 2 µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
40 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
41 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
42 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
43 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
44 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) (unbiased E[Z i X i, A i ] = T Q k (X i, A i )), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
45 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
46 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i Since Q α is a linear function in α, the problem is a simple quadratic minimization problem with closed form solution. i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
47 Approximate Value Iteration Other implementations K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector machine A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
48 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
49 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
50 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
51 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
52 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
53 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
54 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
55 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy Management cost Value function wear K R K R K R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
56 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function wear Linear approximation space F := K R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
57 + +++ Approximate Value Iteration Example: the Optimal Replacement Problem Collect N sample on a uniform grid A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
58 + +++ Approximate Value Iteration Example: the Optimal Replacement Problem Collect N sample on a uniform grid Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
59 Approximate Value Iteration Example: the Optimal Replacement Problem Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
60 Approximate Policy Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
61 Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
62 Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
63 Approximate Policy Iteration Approximate Policy Iteration: performance loss Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k k }{{} performance loss 2γ (1 γ) 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
64 Approximate Policy Iteration Linear Temporal-Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
65 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): the algorithm Algorithm Definition Given a linear space F = {V α (x) = d i=1 α iφ i (x), α R d }. Trace vector z R d and parameter vector α R d initialized to zero. Generate a sequence of states (x 0, x 1, x 2,... ) according to π. At each step t, the temporal difference is d t = r(x t, π(x t )) + γv αt (x t+1 ) V αt (x t ) and the parameters are updated as where η t is learning step. α t+1 = α t + η t d t z t, z t+1 = λγz t + φ(x t+1 ), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
66 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Proposition (Tsitsiklis et Van Roy, 1996) Let the learning rate η t satisfy t 0 η t =, and t 0 η 2 t <. We assume that π admits a stationary distribution µ π and that the features (φ i ) 1 k K are linearly independent. There exists a fixed α such that lim t α t = α. Furthermore we obtain V α V π 2,µ π 1 λγ inf }{{} 1 γ V α V π 2,µ π. α }{{} approximation error smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
67 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
68 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! Problem: the bound does not consider the variance (i.e., samples needed for α t to converge to α ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
69 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): implementation Pros: simple to implement, computational cost linear in d. Cons: very sample inefficient, many samples are needed to converge. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
70 Approximate Policy Iteration Least-Squares Temporal Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
71 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Recall: V π = T π V π. Intuition: compute V = AT π V. V π F T π T π V T D Π µ V π T π V T D = Π µ T π V T D Focus on the L 2,µ -weighted norm and projection Π µ Π µ g = arg min f F f g µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
72 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d T π V TD V TD, φ i µ = 0, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
73 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
74 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 α TD is the solution of a linear system of order d. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
75 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Algorithm Definition The LSTD solution α TD can be computed by computing the matrix A and vector b defined as A i,j = φ i, φ j γp π φ j µ b i = φ i, r π µ, and then solving the system Aα = b. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
76 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
77 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). Solution: use the stationary distribution µ π of policy π, that is µ π P π = µ π, and µ π (y) = x p(y x, π(x))µ π (x) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
78 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Proposition The Bellman operator T π is a contraction in the weighted L 2,µπ -norm. Thus the joint operator Π µπ T π is a contraction and it admits a unique fixed point V TD. Then V π V TD µπ }{{} approximation error 1 1 γ 2 inf V π V µπ. V F }{{} smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
79 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
80 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates  ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
81 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb  ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
82 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb Remark: Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 No need for a generative model. If the chain is ergodic, Â A et ˆb b when n. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
83 Approximate Policy Iteration Bellman Residual Minimization Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
84 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
85 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
86 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
87 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
88 Approximate Policy Iteration Bellman Residual Minimization BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
89 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
90 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
91 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
92 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system  i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
93 Approximate Policy Iteration Bellman Residual Minimization LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
94 Approximate Policy Iteration Bellman Residual Minimization Bibliography I A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52
95 Approximate Policy Iteration Bellman Residual Minimization Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr
Approximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReinforcement Learning
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationAnalysis of Classification-based Policy Iteration Algorithms
Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationTutorial on Policy Gradient Methods. Jan Peters
Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationGeometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation
Geometric Variance Reduction in arkovchains. Application tovalue Function and Gradient Estimation Rémi unos Centre de athématiques Appliquées, Ecole Polytechnique, 98 Palaiseau Cedex, France. remi.munos@polytechnique.fr
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationLearning the Variance of the Reward-To-Go
Journal of Machine Learning Research 17 (2016) 1-36 Submitted 8/14; Revised 12/15; Published 3/16 Learning the Variance of the Reward-To-Go Aviv Tamar Department of Electrical Engineering and Computer
More informationA Dantzig Selector Approach to Temporal Difference Learning
Matthieu Geist Supélec, IMS Research Group, Metz, France Bruno Scherrer INRIA, MAIA Project Team, Nancy, France Alessandro Lazaric and Mohammad Ghavamzadeh INRIA Lille - Team SequeL, France matthieu.geist@supelec.fr
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationRegularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization
Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Matthew W. Hoffman 1, Alessandro Lazaric 2, Mohammad Ghavamzadeh 2, and Rémi Munos 2 1 University of British
More informationINTRODUCTION TO MARKOV DECISION PROCESSES
INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,
More informationOptimal Stopping Problems
2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More informationConvex Optimization. Problem set 2. Due Monday April 26th
Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationChapter 13 Wow! Least Squares Methods in Batch RL
Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual
More informationOptimal Control. McGill COMP 765 Oct 3 rd, 2017
Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps
More informationConvergence of Synchronous Reinforcement Learning. with linear function approximation
Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationJun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544
Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationA Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies
A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies Huizhen Yu Lab for Information and Decision Systems Massachusetts Institute of Technology Cambridge,
More informationLecture 17: Reinforcement Learning, Finite Markov Decision Processes
CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,
More informationECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control
ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More informationLeast-Squares Methods for Policy Iteration
Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationDual Temporal Difference Learning
Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada
More informationMarkov Decision Processes and Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of
More information6 Basic Convergence Results for RL Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 6 Basic Convergence Results for RL Algorithms We establish here some asymptotic convergence results for the basic RL algorithms, by showing
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationPreconditioned Temporal Difference Learning
Hengshuai Yao Zhi-Qiang Liu School of Creative Media, City University of Hong Kong, Hong Kong, China hengshuai@gmail.com zq.liu@cityu.edu.hk Abstract This paper extends many of the recent popular policy
More informationOn Average Versus Discounted Reward Temporal-Difference Learning
Machine Learning, 49, 179 191, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. On Average Versus Discounted Reward Temporal-Difference Learning JOHN N. TSITSIKLIS Laboratory for
More informationCentral-limit approach to risk-aware Markov decision processes
Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More informationGeneralization and Function Approximation
Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
More informationChapter 8: Generalization and Function Approximation
Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationLinear Least-squares Dyna-style Planning
Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationJournal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems
Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More information