Approximate Dynamic Programming

Size: px
Start display at page:

Download "Approximate Dynamic Programming"

Transcription

1 Approximate Dynamic Programming A. LAZARIC (SequeL Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course

2 Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

3 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

4 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

5 Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

6 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

7 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

8 Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V γ k+1 V 0 V 0 Problem: what if V k+1 T V k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

9 Policy Iteration: the Idea 1. Let π 0 be any stationary policy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

10 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

11 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

12 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

13 Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given π k, compute V k = V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

14 Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

15 Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. Problem: what if V k V π k?? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

16 Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

17 Sources of Error Approximation error. If X is large or continuous, value functions V cannot be represented correctly use an approximation space F Estimation error. If the reward r and dynamics p are unknown, the Bellman operators T and T π cannot be computed exactly estimate the Bellman operators from samples A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

18 Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

19 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

20 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

21 Performance Loss From Approximation Error to Performance Loss Question: if V is an approximation of the optimal value function V with an error error = V V how does it translate to the (loss of) performance of the greedy policy π(x) arg max p(y x, a) [ r(x, a, y) + γv (y) ] a A y i.e.??? performance loss = V V π A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

22 Performance Loss From Approximation Error to Performance Loss Proposition Let V R N be an approximation of V and π its corresponding greedy policy, then V V π }{{ } performance loss 2γ 1 γ V V }{{. } approx. error Furthermore, there exists ɛ > 0 such that if V V ɛ, then π is optimal. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

23 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

24 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

25 Performance Loss From Approximation Error to Performance Loss Question: how do we compute V? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V Objective: given an approximation space F, compute an approximation V which is as close as possible to the best approximation of V in F, i.e. V arg inf f F V f A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

26 Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

27 Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

28 Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator. 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute Vk+1 = AT V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

29 Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π be a projection operator in L -norm, which corresponds to V k+1 = Π T V k = arg inf V F T V k V A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

30 Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π is a non-expansion and the joint operator Π T is a contraction. Then there exists a unique fixed point Ṽ = Π T Ṽ which guarantees the convergence of AVI. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

31 Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then V V π K 2γ (1 γ) 2 max T V k AT V k 0 k<k }{{} worst approx. error + 2γK+1 1 γ V V 0. }{{} initial error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

32 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

33 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

34 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

35 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

36 Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. State x Action a Generative model Reward r(x, a) Next state y p( x, a) Idea: work with Q-functions and linear spaces. Q is the unique fixed point of T defined over X A as: T Q(x, a) = y p(y x, a)[r(x, a, y) + γ max Q(y, b)]. b F is a space defined by d features φ 1,..., φ d : X A R as: { d F = Q α (x, a) = α j φ j (x, a), α R d}. j=1 At each iteration compute Q k+1 = Π T Q k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

37 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration compute Q k+1 = Π T Q k Problems: the Π operator cannot be computed efficiently the Bellman operator T is often unknown A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

38 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

39 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π operator cannot be computed efficiently. Let µ a distribution over X. We use a projection in L 2,µ -norm onto the space F: Q k+1 = arg min Q F Q T Q k 2 µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

40 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

41 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

42 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

43 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

44 Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown. 1. Sample n state actions (X i, A i ) with X i µ and A i random, 2. Simulate Y i p( X i, A i ) and R i = r(x i, A i, Y i ) with the generative model, 3. Estimate T Q k (X i, A i ) with Z i = R i + γ max a A Q k(y i, a) (unbiased E[Z i X i, A i ] = T Q k (X i, A i )), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

45 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

46 Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k+1 as 1 Q k+1 = arg min Q α F n n [ ] 2 Qα (X i, A i ) Z i Since Q α is a linear function in α, the problem is a simple quadratic minimization problem with closed form solution. i=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

47 Approximate Value Iteration Other implementations K-nearest neighbour Regularized linear regression with L 1 or L 2 regularisation Neural network Support vector machine A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

48 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

49 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

50 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

51 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

52 Approximate Value Iteration Example: the Optimal Replacement Problem State: level of wear of an object (e.g., a car). Action: {(R)eplace, (K)eep}. Cost: c(x, R) = C c(x, K) = c(x) maintenance plus extra costs. Dynamics: p( x, R) = exp(β) with density d(y) = β exp βy I{y 0}, p( x, K) = x + exp(β) with density d(y x). Problem: Minimize the discounted expected cost over an infinite horizon. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

53 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

54 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

55 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy Management cost Value function wear K R K R K R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

56 Approximate Value Iteration Example: the Optimal Replacement Problem Optimal value function { V (x) = min c(x)+γ 0 d(y x)v (y)dy, C +γ Optimal policy: action that attains the minimum 0 } d(y)v (y)dy 70 Management cost 70 Value function wear Linear approximation space F := K R K R K { V n (x) = } 20 k=1 α k cos(kπ x x max ). R A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

57 + +++ Approximate Value Iteration Example: the Optimal Replacement Problem Collect N sample on a uniform grid A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

58 + +++ Approximate Value Iteration Example: the Optimal Replacement Problem Collect N sample on a uniform grid Figure: Left: the target values computed as {T V 0 (x n )} 1 n N. Right: the approximation V 1 F of the target function T V 0. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

59 Approximate Value Iteration Example: the Optimal Replacement Problem Figure: Left: the target values computed as {T V 1 (x n )} 1 n N. Center: the approximation V 2 F of T V 1. Right: the approximation V n F after n iterations. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

60 Approximate Policy Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

61 Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

62 Approximate Policy Iteration Approximate Policy Iteration: the Idea Let A be an approximation operator. Policy evaluation: given the current policy π k, compute V k = AV π k Policy improvement: given the approximated value of the current policy, compute the greedy policy w.r.t. V k as [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v k (y) ]. a A y X Problem: the algorithm is no longer guaranteed to converge. V * V π k Asymptotic Error k A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

63 Approximate Policy Iteration Approximate Policy Iteration: performance loss Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: lim sup V V π k k }{{} performance loss 2γ (1 γ) 2 lim sup k V k V π k }{{} approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

64 Approximate Policy Iteration Linear Temporal-Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

65 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): the algorithm Algorithm Definition Given a linear space F = {V α (x) = d i=1 α iφ i (x), α R d }. Trace vector z R d and parameter vector α R d initialized to zero. Generate a sequence of states (x 0, x 1, x 2,... ) according to π. At each step t, the temporal difference is d t = r(x t, π(x t )) + γv αt (x t+1 ) V αt (x t ) and the parameters are updated as where η t is learning step. α t+1 = α t + η t d t z t, z t+1 = λγz t + φ(x t+1 ), A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

66 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Proposition (Tsitsiklis et Van Roy, 1996) Let the learning rate η t satisfy t 0 η t =, and t 0 η 2 t <. We assume that π admits a stationary distribution µ π and that the features (φ i ) 1 k K are linearly independent. There exists a fixed α such that lim t α t = α. Furthermore we obtain V α V π 2,µ π 1 λγ inf }{{} 1 γ V α V π 2,µ π. α }{{} approximation error smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

67 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

68 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): approximation error Remark: for λ = 1, we recover Monte-Carlo (or TD(1)) and the bound is the smallest! Problem: the bound does not consider the variance (i.e., samples needed for α t to converge to α ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

69 Approximate Policy Iteration Linear Temporal-Difference Linear TD(λ): implementation Pros: simple to implement, computational cost linear in d. Cons: very sample inefficient, many samples are needed to converge. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

70 Approximate Policy Iteration Least-Squares Temporal Difference Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

71 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Recall: V π = T π V π. Intuition: compute V = AT π V. V π F T π T π V T D Π µ V π T π V T D = Π µ T π V T D Focus on the L 2,µ -weighted norm and projection Π µ Π µ g = arg min f F f g µ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

72 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d T π V TD V TD, φ i µ = 0, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

73 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

74 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm By construction, the Bellman residual of V TD is orthogonal to F, thus for any 1 i d and T π V TD V TD, φ i µ = 0, r π + γp π V TD V TD, φ i µ = 0 d r π, φ i µ + γp π φ j φ j, φ i µ α TD,j = 0, j=1 α TD is the solution of a linear system of order d. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

75 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the algorithm Algorithm Definition The LSTD solution α TD can be computed by computing the matrix A and vector b defined as A i,j = φ i, φ j γp π φ j µ b i = φ i, r π µ, and then solving the system Aα = b. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

76 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

77 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Problem: in general Π µ T π does not admit a fixed point (i.e., matrix A is not invertible). Solution: use the stationary distribution µ π of policy π, that is µ π P π = µ π, and µ π (y) = x p(y x, π(x))µ π (x) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

78 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the approximation error Proposition The Bellman operator T π is a contraction in the weighted L 2,µπ -norm. Thus the joint operator Π µπ T π is a contraction and it admits a unique fixed point V TD. Then V π V TD µπ }{{} approximation error 1 1 γ 2 inf V π V µπ. V F }{{} smallest approximation error A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

79 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

80 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates  ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

81 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb  ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

82 Approximate Policy Iteration Least-Squares Temporal Difference Least-squares TD: the implementation Generate (X 0, X 1,... ) from direct execution of π and observes R t = r(x t, π(x t )) Compute estimates Solve Âα = ˆb Remark: Â ij = 1 n ˆb i = 1 n n φ i (X t )[φ j (X t ) γφ j (X t+1 )], t=1 n φ i (X t )R t. t=1 No need for a generative model. If the chain is ergodic, Â A et ˆb b when n. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

83 Approximate Policy Iteration Bellman Residual Minimization Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration Linear Temporal-Difference Least-Squares Temporal Difference Bellman Residual Minimization A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

84 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea V π F T π T π V BR arg min V F V π V T π V BR = arg min V F T π V V Let µ be a distribution over X, V BR is the minimum Bellman residual w.r.t. T π V BR = arg min V F T π V V 2,µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

85 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea The mapping α T π V α V α is affine The function α T π V α V α 2 µ is quadratic The minimum is obtained by computing the gradient and setting it to zero r π + (γp π I) d φ j α j, (γp π I)φ i µ = 0, j=1 which can be rewritten as Aα = b, with { Ai,j = φ i γp π φ i, φ j γp π φ j µ, b i = φ i γp π φ i, r π µ, A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

86 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

87 Approximate Policy Iteration Bellman Residual Minimization Bellman Residual Minimization (BRM): the idea Remark: the system admits a solution whenever the features φ i are linearly independent w.r.t. µ Remark: let {ψ i = φ i γp π φ i } i=1...d, then the previous system can be interpreted as a linear regression problem α ψ r π µ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

88 Approximate Policy Iteration Bellman Residual Minimization BRM: the approximation error Proposition We have V π V BR (I γp π ) 1 (1 + γ P π ) inf V F V π V. If µ π is the stationary policy of π, then P π µπ = 1 and (I γp π ) 1 µπ = 1 1 γ, thus V π V BR µπ 1 + γ 1 γ inf V F V π V µπ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

89 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Assumption. A generative model is available. Drawn n states X t µ Call generative model on (X t, A t ) (with A t = π(x t )) and obtain R t = r(x t, A t ), Y t p( X t, A t ) Compute ˆB(V ) = 1 n n [ V (Xt ) ( R t + γv (Y t ) ) ] 2. }{{} ˆT V (X t) t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

90 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Problem: this estimator is biased and not consistent! In fact, [ [V E[ ˆB(V )] = E (Xt ) T π V (X t ) + T π V (X t ) ˆT V (X t ) ] ] 2 [ [T = T π V V 2 µ + E π V (X t ) ˆT V (X t ) ] ] 2 minimizing ˆB(V ) does not correspond to minimizing B(V ) (even when n ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

91 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation Solution. In each state X t, generate two independent samples Y t et Y t p( X t, A t ) Define ˆB(V ) = 1 n n t=1 ˆB B for n. [ V (Xt ) ( R t + γv (Y t ) )][ V (X t ) ( R t + γv (Y t ) )]. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

92 Approximate Policy Iteration Bellman Residual Minimization BRM: the implementation The function α ˆB(V α ) is quadratic and we obtain the linear system  i,j = 1 n b i = 1 n n t=1 t=1 [ φi (X t ) γφ i (Y t ) ][ φ j (X t ) γφ j (Y t ) ], n [ φi (X t ) γ φ i(y t ) + φ i (Y t )] Rt. 2 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

93 Approximate Policy Iteration Bellman Residual Minimization LSTD vs BRM Different assumptions: BRM requires a generative model, LSTD requires a single trajectory. The performance is evaluated differently: BRM any distribution, LSTD stationary distribution µ π. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

94 Approximate Policy Iteration Bellman Residual Minimization Bibliography I A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /52

95 Approximate Policy Iteration Bellman Residual Minimization Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Analysis of Classification-based Policy Iteration Algorithms

Analysis of Classification-based Policy Iteration Algorithms Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation

Geometric Variance Reduction in MarkovChains. Application tovalue Function and Gradient Estimation Geometric Variance Reduction in arkovchains. Application tovalue Function and Gradient Estimation Rémi unos Centre de athématiques Appliquées, Ecole Polytechnique, 98 Palaiseau Cedex, France. remi.munos@polytechnique.fr

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Learning the Variance of the Reward-To-Go

Learning the Variance of the Reward-To-Go Journal of Machine Learning Research 17 (2016) 1-36 Submitted 8/14; Revised 12/15; Published 3/16 Learning the Variance of the Reward-To-Go Aviv Tamar Department of Electrical Engineering and Computer

More information

A Dantzig Selector Approach to Temporal Difference Learning

A Dantzig Selector Approach to Temporal Difference Learning Matthieu Geist Supélec, IMS Research Group, Metz, France Bruno Scherrer INRIA, MAIA Project Team, Nancy, France Alessandro Lazaric and Mohammad Ghavamzadeh INRIA Lille - Team SequeL, France matthieu.geist@supelec.fr

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization

Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Regularized Least Squares Temporal Difference learning with nested l 2 and l 1 penalization Matthew W. Hoffman 1, Alessandro Lazaric 2, Mohammad Ghavamzadeh 2, and Rémi Munos 2 1 University of British

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Chapter 13 Wow! Least Squares Methods in Batch RL

Chapter 13 Wow! Least Squares Methods in Batch RL Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies Huizhen Yu Lab for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Least-Squares Methods for Policy Iteration

Least-Squares Methods for Policy Iteration Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of

More information

6 Basic Convergence Results for RL Algorithms

6 Basic Convergence Results for RL Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 6 Basic Convergence Results for RL Algorithms We establish here some asymptotic convergence results for the basic RL algorithms, by showing

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Preconditioned Temporal Difference Learning

Preconditioned Temporal Difference Learning Hengshuai Yao Zhi-Qiang Liu School of Creative Media, City University of Hong Kong, Hong Kong, China hengshuai@gmail.com zq.liu@cityu.edu.hk Abstract This paper extends many of the recent popular policy

More information

On Average Versus Discounted Reward Temporal-Difference Learning

On Average Versus Discounted Reward Temporal-Difference Learning Machine Learning, 49, 179 191, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. On Average Versus Discounted Reward Temporal-Difference Learning JOHN N. TSITSIKLIS Laboratory for

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information