Markov Decision Processes and Dynamic Programming

Size: px
Start display at page:

Download "Markov Decision Processes and Dynamic Programming"

Transcription

1 Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course

2 In This Lecture A. LAZARIC Markov Decision Processes and Dynamic Programming 2/81

3 In This Lecture How do we formalize the agent-environment interaction? Markov Decision Process (MDP) A. LAZARIC Markov Decision Processes and Dynamic Programming 2/81

4 In This Lecture How do we formalize the agent-environment interaction? Markov Decision Process (MDP) How do we solve an MDP? Dynamic Programming A. LAZARIC Markov Decision Processes and Dynamic Programming 2/81

5 Mathematical Tools Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC Markov Decision Processes and Dynamic Programming 3/81

6 Mathematical Tools Probability Theory Definition (Conditional probability) Given two events A and B with P(B) > 0, the conditional probability of A given B is P(A B) = P(A B). P(B) A. LAZARIC Markov Decision Processes and Dynamic Programming 4/81

7 Mathematical Tools Probability Theory Definition (Conditional probability) Given two events A and B with P(B) > 0, the conditional probability of A given B is P(A B) = P(A B). P(B) Similarly, if X and Y are non-degenerate and jointly continuous random variables with density f X,Y (x, y) then if B has positive measure then the conditional probability is y B x A P(X A Y B) = f X,Y (x, y)dxdy x f X,Y (x, y)dxdy. y B A. LAZARIC Markov Decision Processes and Dynamic Programming 4/81

8 Mathematical Tools Probability Theory Definition (Law of total expectation) Given a function f and two random variables X, Y we have that E X,Y [ f (X, Y ) ] = EX [E Y [ f (x, Y ) X = x ] ]. A. LAZARIC Markov Decision Processes and Dynamic Programming 5/81

9 Mathematical Tools Norms and Contractions Definition Given a vector space V R d a function f : V R + 0 an only if If f (v) = 0 for some v V, then v = 0. For any λ R, v V, f (λv) = λ f (v). is a norm if Triangle inequality: For any v, u V, f (v + u) f (v) + f (u). A. LAZARIC Markov Decision Processes and Dynamic Programming 6/81

10 Mathematical Tools Norms and Contractions L p -norm ( d ) 1/p v p = v i p. i=1 A. LAZARIC Markov Decision Processes and Dynamic Programming 7/81

11 Mathematical Tools Norms and Contractions L p -norm L -norm ( d ) 1/p v p = v i p. i=1 v = max 1 i d v i. A. LAZARIC Markov Decision Processes and Dynamic Programming 7/81

12 Mathematical Tools Norms and Contractions L p -norm L -norm L µ,p -norm ( d ) 1/p v p = v i p. i=1 v = max 1 i d v i. ( d v µ,p = i=1 v i p ) 1/p. µ i A. LAZARIC Markov Decision Processes and Dynamic Programming 7/81

13 Mathematical Tools Norms and Contractions L p -norm L -norm L µ,p -norm L µ,p -norm ( d ) 1/p v p = v i p. i=1 v = max 1 i d v i. ( d v µ,p = i=1 v i p ) 1/p. µ i v i v µ, = max. 1 i d µ i A. LAZARIC Markov Decision Processes and Dynamic Programming 7/81

14 Mathematical Tools Norms and Contractions L p -norm L -norm ( d ) 1/p v p = v i p. i=1 v = max 1 i d v i. L µ,p -norm L µ,p -norm ( d v µ,p = i=1 v i p ) 1/p. µ i v i v µ, = max. 1 i d µ i L 2,P -matrix norm (P is a positive definite matrix) v 2 P = v Pv. A. LAZARIC Markov Decision Processes and Dynamic Programming 7/81

15 Mathematical Tools Norms and Contractions Definition A sequence of vectors v n V (with n N) is said to converge in norm to v V if lim n v n v = 0. A. LAZARIC Markov Decision Processes and Dynamic Programming 8/81

16 Mathematical Tools Norms and Contractions Definition A sequence of vectors v n V (with n N) is said to converge in norm to v V if lim n v n v = 0. Definition A sequence of vectors v n V (with n N) is a Cauchy sequence if lim sup m n v n v m = 0. n A. LAZARIC Markov Decision Processes and Dynamic Programming 8/81

17 Mathematical Tools Norms and Contractions Definition A sequence of vectors v n V (with n N) is said to converge in norm to v V if lim n v n v = 0. Definition A sequence of vectors v n V (with n N) is a Cauchy sequence if Definition lim sup m n v n v m = 0. n A vector space V equipped with a norm is complete if every Cauchy sequence in V is convergent in the norm of the space. A. LAZARIC Markov Decision Processes and Dynamic Programming 8/81

18 Mathematical Tools Norms and Contractions Definition An operator T : V V is L-Lipschitz if for any v, u V T v T u L u v. If L 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if v n v then T v n T v. A. LAZARIC Markov Decision Processes and Dynamic Programming 9/81

19 Mathematical Tools Norms and Contractions Definition An operator T : V V is L-Lipschitz if for any v, u V T v T u L u v. If L 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is Definition if v n v then T v n T v. A vector v V is a fixed point of the operator T : V V if T v = v. A. LAZARIC Markov Decision Processes and Dynamic Programming 9/81

20 Mathematical Tools Norms and Contractions Proposition (Banach Fixed Point Theorem) Let V be a complete vector space equipped with the norm and T : V V be a γ-contraction mapping. Then 1. T admits a unique fixed point v. 2. For any v 0 V, if v n+1 = T v n then v n v with a geometric convergence rate: v n v γ n v 0 v. A. LAZARIC Markov Decision Processes and Dynamic Programming 10/81

21 Mathematical Tools Linear Algebra Given a square matrix A R N N : Eigenvalues of a matrix (1). v R N and λ R are eigenvector and eigenvalue of A if Av = λv. A. LAZARIC Markov Decision Processes and Dynamic Programming 11/81

22 Mathematical Tools Linear Algebra Given a square matrix A R N N : Eigenvalues of a matrix (1). v R N and λ R are eigenvector and eigenvalue of A if Av = λv. Eigenvalues of a matrix (2). If A has eigenvalues {λ i } N i=1, then B = (I αa) has eigenvalues {µ i } µ i = 1 αλ i. A. LAZARIC Markov Decision Processes and Dynamic Programming 11/81

23 Mathematical Tools Linear Algebra Given a square matrix A R N N : Eigenvalues of a matrix (1). v R N and λ R are eigenvector and eigenvalue of A if Av = λv. Eigenvalues of a matrix (2). If A has eigenvalues {λ i } N i=1, then B = (I αa) has eigenvalues {µ i } µ i = 1 αλ i. Matrix inversion. A can be inverted if and only if i, λ i 0. A. LAZARIC Markov Decision Processes and Dynamic Programming 11/81

24 Mathematical Tools Linear Algebra Stochastic matrix. A square matrix P R N N is a stochastic matrix if 1. all non-zero entries, i, j, [P] i,j 0 2. all the rows sum to one, i, N j=1 [P] i,j = 1. All the eigenvalues of a stochastic matrix are bounded by 1, i.e., i, λ i 1. A. LAZARIC Markov Decision Processes and Dynamic Programming 12/81

25 The Markov Decision Process Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC Markov Decision Processes and Dynamic Programming 13/81

26 The Markov Decision Process The Reinforcement Learning Model Environment Critic action / actuation reward state / perception Learning Agent A. LAZARIC Markov Decision Processes and Dynamic Programming 14/81

27 The Markov Decision Process Markov Chains Definition (Markov chain) Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system (x t ) t N X is a Markov chain if it satisfies the Markov property P(x t+1 = x x t, x t 1,..., x 0 ) = P(x t+1 = x x t ), Given an initial state x 0 X, a Markov chain is defined by the transition probability p p(y x) = P(x t+1 = y x t = x). A. LAZARIC Markov Decision Processes and Dynamic Programming 15/81

28 The Markov Decision Process Example: Weather prediction Informal definition: we want to describe how the weather evolves over time. Board! A. LAZARIC Markov Decision Processes and Dynamic Programming 16/81

29 The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = (X, A, p, r) where A. LAZARIC Markov Decision Processes and Dynamic Programming 17/81

30 The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = (X, A, p, r) where t is the time clock, A. LAZARIC Markov Decision Processes and Dynamic Programming 17/81

31 The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = (X, A, p, r) where t is the time clock, X is the state space, A. LAZARIC Markov Decision Processes and Dynamic Programming 17/81

32 The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = (X, A, p, r) where t is the time clock, X is the state space, A is the action space, A. LAZARIC Markov Decision Processes and Dynamic Programming 17/81

33 The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = (X, A, p, r) where t is the time clock, X is the state space, A is the action space, p(y x, a) is the transition probability with p(y x, a) = P(x t+1 = y x t = x, a t = a), A. LAZARIC Markov Decision Processes and Dynamic Programming 17/81

34 The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = (X, A, p, r) where t is the time clock, X is the state space, A is the action space, p(y x, a) is the transition probability with p(y x, a) = P(x t+1 = y x t = x, a t = a), r(x, a, y) is the reward of transition (x, a, y). A. LAZARIC Markov Decision Processes and Dynamic Programming 17/81

35 The Markov Decision Process Examples Park a car Find the shortest path from home to school Schedule a fleet of truck A. LAZARIC Markov Decision Processes and Dynamic Programming 18/81

36 The Markov Decision Process Policy Definition (Policy) A decision rule π t can be Deterministic: π t : X A, Stochastic: π t : X (A), A. LAZARIC Markov Decision Processes and Dynamic Programming 19/81

37 The Markov Decision Process Policy Definition (Policy) A decision rule π t can be Deterministic: π t : X A, Stochastic: π t : X (A), A policy (strategy, plan) can be Non-stationary: π = (π 0, π 1, π 2,... ), Stationary (Markovian): π = (π, π, π,... ). A. LAZARIC Markov Decision Processes and Dynamic Programming 19/81

38 The Markov Decision Process Policy Definition (Policy) A decision rule π t can be Deterministic: π t : X A, Stochastic: π t : X (A), A policy (strategy, plan) can be Non-stationary: π = (π 0, π 1, π 2,... ), Stationary (Markovian): π = (π, π, π,... ). Remark: MDP M + stationary policy π Markov chain of state X and transition probability p(y x) = p(y x, π(x)). A. LAZARIC Markov Decision Processes and Dynamic Programming 19/81

39 The Markov Decision Process Question Is the MDP formalism powerful enough? Let s try! A. LAZARIC Markov Decision Processes and Dynamic Programming 20/81

40 The Markov Decision Process Example: the Retail Store Management Problem Description. At each month t, a store contains x t items of a specific goods and the demand for that goods is D t. At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that The cost of maintaining an inventory of x is h(x). The cost to order a items is C(a). The income for selling q items is f (q). If the demand D is bigger than the available inventory x, customers that cannot be served leave. The value of the remaining inventory at the end of the year is g(x). Constraint: the store has a maximum capacity M. A. LAZARIC Markov Decision Processes and Dynamic Programming 21/81

41 The Markov Decision Process Example: the Retail Store Management Problem State space: x X = {0, 1,..., M}. A. LAZARIC Markov Decision Processes and Dynamic Programming 22/81

42 The Markov Decision Process Example: the Retail Store Management Problem State space: x X = {0, 1,..., M}. Action space: it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at statex, a A(x) = {0, 1,..., M x}. A. LAZARIC Markov Decision Processes and Dynamic Programming 22/81

43 The Markov Decision Process Example: the Retail Store Management Problem State space: x X = {0, 1,..., M}. Action space: it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at statex, a A(x) = {0, 1,..., M x}. Dynamics: x t+1 = [x t + a t D t ] +. Problem: the dynamics should be Markov and stationary! A. LAZARIC Markov Decision Processes and Dynamic Programming 22/81

44 The Markov Decision Process Example: the Retail Store Management Problem State space: x X = {0, 1,..., M}. Action space: it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at statex, a A(x) = {0, 1,..., M x}. Dynamics: x t+1 = [x t + a t D t ] +. Problem: the dynamics should be Markov and stationary! The demand D t is stochastic and time-independent. Formally, i.i.d. D t D. A. LAZARIC Markov Decision Processes and Dynamic Programming 22/81

45 The Markov Decision Process Example: the Retail Store Management Problem State space: x X = {0, 1,..., M}. Action space: it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at statex, a A(x) = {0, 1,..., M x}. Dynamics: x t+1 = [x t + a t D t ] +. Problem: the dynamics should be Markov and stationary! The demand D t is stochastic and time-independent. Formally, i.i.d. D t D. Reward: r t = C(a t ) h(x t + a t ) + f ([x t + a t x t+1 ] + ). A. LAZARIC Markov Decision Processes and Dynamic Programming 22/81

46 The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. 1 2 Reward t T p(t) Restaurant Reward 0 A. LAZARIC Markov Decision Processes and Dynamic Programming 23/81

47 The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. 1 2 Reward t T p(t) Restaurant Reward 0 The driver cannot see whether a place is available unless he is in front of it. There are P places. At each place i the driver can either move to the next place or park (if the place is available). The closer to the restaurant the parking, the higher the satisfaction. If the driver doesn t park anywhere, then he/she leaves the restaurant and has to find another one. A. LAZARIC Markov Decision Processes and Dynamic Programming 23/81

48 The Markov Decision Process Question How do we evaluate a policy and compare two policies? Value function! A. LAZARIC Markov Decision Processes and Dynamic Programming 24/81

49 The Markov Decision Process Optimization over Time Horizon Finite time horizon T : deadline at time T, the agent focuses on the sum of the rewards up to T. A. LAZARIC Markov Decision Processes and Dynamic Programming 25/81

50 The Markov Decision Process Optimization over Time Horizon Finite time horizon T : deadline at time T, the agent focuses on the sum of the rewards up to T. Infinite time horizon with discount: the problem never terminates but rewards which are closer in time receive a higher importance. A. LAZARIC Markov Decision Processes and Dynamic Programming 25/81

51 The Markov Decision Process Optimization over Time Horizon Finite time horizon T : deadline at time T, the agent focuses on the sum of the rewards up to T. Infinite time horizon with discount: the problem never terminates but rewards which are closer in time receive a higher importance. Infinite time horizon with terminal state: the problem never terminates but the agent will eventually reach a termination state. A. LAZARIC Markov Decision Processes and Dynamic Programming 25/81

52 The Markov Decision Process Optimization over Time Horizon Finite time horizon T : deadline at time T, the agent focuses on the sum of the rewards up to T. Infinite time horizon with discount: the problem never terminates but rewards which are closer in time receive a higher importance. Infinite time horizon with terminal state: the problem never terminates but the agent will eventually reach a termination state. Infinite time horizon with average reward: the problem never terminates but the agent only focuses on the (expected) average of the rewards. A. LAZARIC Markov Decision Processes and Dynamic Programming 25/81

53 The Markov Decision Process State Value Function Finite time horizon T : deadline at time T, the agent focuses on the sum of the rewards up to T. [ T 1 ] V π (t, x) = E r(x s, π s (x s )) + R(x T ) x t = x; π, s=t where R is a value function for the final state. A. LAZARIC Markov Decision Processes and Dynamic Programming 26/81

54 The Markov Decision Process State Value Function Infinite time horizon with discount: the problem never terminates but rewards which are closer in time receive a higher importance. [ ] V π (x) = E γ t r(x t, π(x t )) x 0 = x; π, t=0 with discount factor 0 γ < 1: small = short-term rewards, big = long-term rewards for any γ [0, 1) the series always converge (for bounded rewards) A. LAZARIC Markov Decision Processes and Dynamic Programming 27/81

55 The Markov Decision Process State Value Function Infinite time horizon with terminal state: the problem never terminates but the agent will eventually reach a termination state. [ T ] V π (x) = E r(x t, π(x t )) x 0 = x; π, t=0 where T is the first (random) time when the termination state is achieved. A. LAZARIC Markov Decision Processes and Dynamic Programming 28/81

56 The Markov Decision Process State Value Function Infinite time horizon with average reward: the problem never terminates but the agent only focuses on the (expected) average of the rewards. [ 1 V π (x) = lim E T T T 1 t=0 ] r(x t, π(x t )) x 0 = x; π. A. LAZARIC Markov Decision Processes and Dynamic Programming 29/81

57 The Markov Decision Process State Value Function Technical note: the expectations refer to all possible stochastic trajectories. A. LAZARIC Markov Decision Processes and Dynamic Programming 30/81

58 The Markov Decision Process State Value Function Technical note: the expectations refer to all possible stochastic trajectories. A non-stationary policy π applied from state x 0 returns (x 0, r 0, x 1, r 1, x 2, r 2,...) with r t = r(x t, π t (x t )) and x t p( x t 1, a t = π(x t )) are random realizations. The value function (discounted infinite horizon) is V π (x) = E (x1,x 2,...) [ t=0 ] γ t r(x t, π(x t )) x 0 = x; π, A. LAZARIC Markov Decision Processes and Dynamic Programming 30/81

59 The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π satisfying π arg max π Π V π in all the states x X, where Π is some policy set of interest. A. LAZARIC Markov Decision Processes and Dynamic Programming 31/81

60 The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π satisfying π arg max π Π V π in all the states x X, where Π is some policy set of interest. The corresponding value function is the optimal value function V = V π. A. LAZARIC Markov Decision Processes and Dynamic Programming 31/81

61 The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π satisfying π arg max π Π V π in all the states x X, where Π is some policy set of interest. The corresponding value function is the optimal value function V = V π. Remark: π arg max( ) and not π = arg max( ) because an MDP may admit more than one optimal policy. A. LAZARIC Markov Decision Processes and Dynamic Programming 31/81

62 The Markov Decision Process Example: the EC student dilemma 1 p=0.5 Rest r=1 Rest 0.4 r= 10 r=0 Work Work Rest 0.6 r=100 r= Work 0.5 r= Rest Work r= 1000 A. LAZARIC Markov Decision Processes and Dynamic Programming 32/81

63 The Markov Decision Process Example: the EC student dilemma Model: all the transitions are Markov, states x 5, x 6, x 7 are terminal. Setting: infinite horizon with terminal states. Objective: find the policy that maximizes the expected sum of rewards before achieving a terminal state. A. LAZARIC Markov Decision Processes and Dynamic Programming 33/81

64 The Markov Decision Process Example: the EC student dilemma V = r=0 0.5 Rest Work r= 1 V = p= Work 0.5 r= Rest 0.6 Work 0.5 r= 10 V = V = Rest Rest Work r=100 r= 10 V = 10 5 V = r= 1000 V = A. LAZARIC Markov Decision Processes and Dynamic Programming 34/81

65 The Markov Decision Process Example: the EC student dilemma V 7 = 1000 V 6 = 100 V 5 = 10 V 4 = V V V 3 = V V V 2 = V V 1 V 1 = max{0.5v V 1, 0.5V V 1 } V 1 = V 2 = 88.3 A. LAZARIC Markov Decision Processes and Dynamic Programming 35/81

66 The Markov Decision Process State-Action Value Function Definition In discounted infinite horizon problems, for any policy π, the state-action value function (or Q-function) Q π : X A R is [ ] Q π (x, a) = E γ t r(x t, a t ) x 0 = x, a 0 = a, a t = π(x t ), t 1, t 0 and the corresponding optimal Q-function is Q (x, a) = max π Qπ (x, a). A. LAZARIC Markov Decision Processes and Dynamic Programming 36/81

67 The Markov Decision Process State-Action Value Function The relationships between the V-function and the Q-function are: Q π (x, a) = r(x, a) + γ p(y x, a)v π (y) y X V π (x) = Q π (x, π(x)) Q (x, a) = r(x, a) + γ p(y x, a)v (y) y X V (x) = Q (x, π (x)) = max a A Q (x, a). A. LAZARIC Markov Decision Processes and Dynamic Programming 37/81

68 Bellman Equations for Discounted Infinite Horizon Problems Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC Markov Decision Processes and Dynamic Programming 38/81

69 Bellman Equations for Discounted Infinite Horizon Problems Question Is there any more compact way to describe a value function? Bellman equations! A. LAZARIC Markov Decision Processes and Dynamic Programming 39/81

70 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Equation Proposition For any stationary policy π = (π, π,... ), the state value function at a state x X satisfies the Bellman equation: V π (x) = r(x, π(x)) + γ y p(y x, π(x))v π (y). A. LAZARIC Markov Decision Processes and Dynamic Programming 40/81

71 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Equation Proof. For any policy π, V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ] t 0 = r(x, π(x)) + E [ γ t r(x t, π(x t )) x 0 = x; π ] = r(x, π(x)) + γ y = r(x, π(x)) + γ y t 1 P(x 1 = y x 0 = x; π(x 0 ))E [ γ t 1 r(x t, π(x t )) x 1 = y; π ] p(y x, π(x))v π (y). t 1 A. LAZARIC Markov Decision Processes and Dynamic Programming 41/81

72 Bellman Equations for Discounted Infinite Horizon Problems The Optimal Bellman Equation Bellman s Principle of Optimality [1]: An optimal policy has the property that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. A. LAZARIC Markov Decision Processes and Dynamic Programming 42/81

73 Bellman Equations for Discounted Infinite Horizon Problems The Optimal Bellman Equation Proposition The optimal value function V (i.e., V = max π V π ) is the solution to the optimal Bellman equation: V [ (x) = max a A r(x, a) + γ p(y x, a)v (y) ]. and the optimal policy is π (x) = arg max a A y [ r(x, a) + γ p(y x, a)v (y) ]. y A. LAZARIC Markov Decision Processes and Dynamic Programming 43/81

74 Bellman Equations for Discounted Infinite Horizon Problems The Optimal Bellman Equation Proof. For any policy π = (a, π ) (possibly non-stationary), V (x) (a) = max π E[ γ t r(x t, π(x t )) x 0 = x; π ] t 0 [ (b) = max r(x, a) + γ (a,π ) y (c) = max a (d) = max a [ r(x, a) + γ y [ r(x, a) + γ y ] p(y x, a)v π (y) ] p(y x, a) max V π (y) π ] p(y x, a)v (y). A. LAZARIC Markov Decision Processes and Dynamic Programming 44/81

75 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Notation. w.l.o.g. a discrete state space X = N and V π R N. Definition For any W R N, the Bellman operator T π : R N R N is T π W (x) = r(x, π(x)) + γ y p(y x, π(x))w (y), and the optimal Bellman operator (or dynamic programming operator) is [ T W (x) = max a A r(x, a) + γ p(y x, a)w (y) ]. y A. LAZARIC Markov Decision Processes and Dynamic Programming 45/81

76 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity: for any W 1, W 2 R N, if W 1 W 2 component-wise, then T π W 1 T π W 2, T W 1 T W 2. A. LAZARIC Markov Decision Processes and Dynamic Programming 46/81

77 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity: for any W 1, W 2 R N, if W 1 W 2 component-wise, then 2. Offset: for any scalar c R, T π W 1 T π W 2, T W 1 T W 2. T π (W + ci N ) = T π W + γci N, T (W + ci N ) = T W + γci N, A. LAZARIC Markov Decision Processes and Dynamic Programming 46/81

78 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition 3. Contraction in L -norm: for any W 1, W 2 R N T π W 1 T π W 2 γ W 1 W 2, T W 1 T W 2 γ W 1 W 2. A. LAZARIC Markov Decision Processes and Dynamic Programming 47/81

79 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition 3. Contraction in L -norm: for any W 1, W 2 R N T π W 1 T π W 2 γ W 1 W 2, T W 1 T W 2 γ W 1 W Fixed point: For any policy π V π is the unique fixed point of T π, V is the unique fixed point of T. A. LAZARIC Markov Decision Processes and Dynamic Programming 47/81

80 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition 3. Contraction in L -norm: for any W 1, W 2 R N T π W 1 T π W 2 γ W 1 W 2, 4. Fixed point: For any policy π T W 1 T W 2 γ W 1 W 2. V π is the unique fixed point of T π, V is the unique fixed point of T. Furthermore for any W R N and any stationary policy π lim (T π ) k W = V π, k lim (T k )k W = V. A. LAZARIC Markov Decision Processes and Dynamic Programming 47/81

81 Bellman Equations for Discounted Infinite Horizon Problems The Bellman Equation Proof. The contraction property (3) holds since for any x X we have T W 1(x) T W 2(x) [ = max r(x, a) + γ p(y x, a)w ] [ a 1(y) max r(x, a ) + γ a y y (a) max a = γ max a [ r(x, a) + γ p(y x, a)w ] 1(y) [ r(x, a) + γ y y p(y x, a) W 1(y) W 2(y) y γ W 1 W 2 max a p(y x, a) = γ W 1 W 2, y p(y x, a )W 2(y) ] p(y x, a)w 2(y) ] where in (a) we used max a f (a) max a g(a ) max a(f (a) g(a)). A. LAZARIC Markov Decision Processes and Dynamic Programming 48/81

82 Bellman Equations for Discounted Infinite Horizon Problems Exercise: Fixed Point Revise the Banach fixed point theorem and prove the fixed point property of the Bellman operator. A. LAZARIC Markov Decision Processes and Dynamic Programming 49/81

83 Bellman Equations for Uniscounted Infinite Horizon Problems Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC Markov Decision Processes and Dynamic Programming 50/81

84 Bellman Equations for Uniscounted Infinite Horizon Problems Question Is there any more compact way to describe a value function when we consider an infinite horizon with no discount? Proper policies and Bellman equations! A. LAZARIC Markov Decision Processes and Dynamic Programming 51/81

85 Bellman Equations for Uniscounted Infinite Horizon Problems The Undiscounted Infinite Horizon Setting The value function is [ T ] V π (x) = E r(x t, π(x t )) x 0 = x; π, t=0 where T is the first random time when the agent achieves a terminal state. A. LAZARIC Markov Decision Processes and Dynamic Programming 52/81

86 Bellman Equations for Uniscounted Infinite Horizon Problems Proper Policies Definition A stationary policy π is proper if n N such that x X the probability of achieving the terminal state x after n steps is strictly positive. That is ρ π = max x P(x n x x 0 = x, π) < 1. A. LAZARIC Markov Decision Processes and Dynamic Programming 53/81

87 Bellman Equations for Uniscounted Infinite Horizon Problems Bounded Value Function Proposition For any proper policy π with parameter ρ π after n steps, the value function is bounded as V π r max ρ t/n π. t 0 A. LAZARIC Markov Decision Processes and Dynamic Programming 54/81

88 Bellman Equations for Uniscounted Infinite Horizon Problems The Undiscounted Infinite Horizon Setting Proof. By definition of proper policy P(x 2n x x 0 = x, π) = P(x 2n x x n x, π) P(x n x x 0 = x, π) ρ 2 π. Then for any t N P(x t x x 0 = x, π) ρ t/n π, which implies that eventually the terminal state x is achieved with probability 1. Then V π = max E[ r(x t, π(x t )) x 0 = x; π ] x X t=0 r max P(x t x x 0 = x, π) t>0 nr max + r max t n ρ t/n π. A. LAZARIC Markov Decision Processes and Dynamic Programming 55/81

89 Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Assumption. There exists at least one proper policy and for any non-proper policy π there exists at least one state x where V π (x) = (cycles with only negative rewards). Proposition ([2]) Under the previous assumption, the optimal value function is bounded, i.e., V < and it is the unique fixed point of the optimal Bellman operator T such that for any vector W R n T W (x) = max a A [ r(x, a) + p(y x, a)w (y)]. y Furthermore V = lim k (T ) k W. A. LAZARIC Markov Decision Processes and Dynamic Programming 56/81

90 Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Proposition Let all the policies π be proper, then there exist µ R N with µ > 0 and a scalar β < 1 such that, x, y X, a A, p(y x, a)µ(y) βµ(x). y Thus both operators T and T π are contraction in the weighted norm L,µ, that is T W 1 T W 2,µ β W 1 W 2,µ. A. LAZARIC Markov Decision Processes and Dynamic Programming 57/81

91 Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Proof. Let µ be the maximum (over policies) of the average time to the termination state. This can be easily casted to a MDP where for any action and any state the rewards are 1 (i.e., for any x X and a A, r(x, a) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to the dynamic programming equation µ(x) = 1 + max p(y x, a)µ(y). a Then µ(x) 1 and for any a A, µ(x) 1 + y p(y x, a)µ(y). Furthermore, p(y x, a)µ(y) µ(x) 1 βµ(x), for y β = max x y µ(x) 1 µ(x) < 1. A. LAZARIC Markov Decision Processes and Dynamic Programming 58/81

92 Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Proof (cont d). From this definition of µ and β we obtain the contraction property of T (similar for T π ) in norm L,µ : T W 1 T W 2,µ = max x max x,a max x,a T W 1 (x) T W 2 (x) µ(x) y p(y x, a) W 1 (y) W 2 (y) µ(x) y p(y x, a)µ(y) W 1 W 2 µ µ(x) β W 1 W 2 µ A. LAZARIC Markov Decision Processes and Dynamic Programming 59/81

93 Dynamic Programming Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC Markov Decision Processes and Dynamic Programming 60/81

94 Dynamic Programming Question How do we compute the value functions / solve an MDP? Value/Policy Iteration algorithms! A. LAZARIC Markov Decision Processes and Dynamic Programming 61/81

95 Dynamic Programming System of Equations The Bellman equation V π (x) = r(x, π(x)) + γ y p(y x, π(x))v π (y). is a linear system of equations with N unknowns and N linear constraints. A. LAZARIC Markov Decision Processes and Dynamic Programming 62/81

96 Dynamic Programming System of Equations The Bellman equation V π (x) = r(x, π(x)) + γ y p(y x, π(x))v π (y). is a linear system of equations with N unknowns and N linear constraints. The optimal Bellman equation V [ (x) = max a A r(x, a) + γ p(y x, a)v (y) ]. is a (highly) non-linear system of equations with N unknowns and N non-linear constraints (i.e., the max operator). y A. LAZARIC Markov Decision Processes and Dynamic Programming 62/81

97 Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Markov Decision Processes and Dynamic Programming 63/81

98 Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K A. LAZARIC Markov Decision Processes and Dynamic Programming 63/81

99 Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k A. LAZARIC Markov Decision Processes and Dynamic Programming 63/81

100 Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1, 2,..., K Compute V k+1 = T V k 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ y ] p(y x, a)v K (y). A. LAZARIC Markov Decision Processes and Dynamic Programming 63/81

101 Dynamic Programming Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k A. LAZARIC Markov Decision Processes and Dynamic Programming 64/81

102 Dynamic Programming Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V = T V k T V γ V k V γ k+1 V 0 V 0 A. LAZARIC Markov Decision Processes and Dynamic Programming 64/81

103 Dynamic Programming Value Iteration: the Guarantees From the fixed point property of T : lim V k = V k From the contraction property of T V k+1 V = T V k T V γ V k V γ k+1 V 0 V 0 Convergence rate. Let ɛ > 0 and r r max, then after at most iterations V K V ɛ. K = log(r max/ɛ) log(1/γ) A. LAZARIC Markov Decision Processes and Dynamic Programming 64/81

104 Dynamic Programming Value Iteration: the Complexity One application of the optimal Bellman operator takes O(N 2 A ) operations. A. LAZARIC Markov Decision Processes and Dynamic Programming 65/81

105 Dynamic Programming Value Iteration: Extensions and Implementations Q-iteration. 1. Let Q 0 be any Q-function 2. At each iteration k = 1, 2,..., K Compute Q k+1 = T Q k 3. Return the greedy policy π K (x) arg max a A Q(x,a) A. LAZARIC Markov Decision Processes and Dynamic Programming 66/81

106 Dynamic Programming Value Iteration: Extensions and Implementations Q-iteration. 1. Let Q 0 be any Q-function 2. At each iteration k = 1, 2,..., K Compute Q k+1 = T Q k 3. Return the greedy policy Asynchronous VI. 1. Let V 0 be any vector in R N π K (x) arg max a A Q(x,a) 2. At each iteration k = 1, 2,..., K Choose a state x k Compute V k+1 (x k ) = T V k (x k ) 3. Return the greedy policy π K (x) arg max a A [ r(x, a) + γ p(y x, a)v K (y) ]. y A. LAZARIC Markov Decision Processes and Dynamic Programming 66/81

107 Dynamic Programming Policy Iteration: the Idea 1. Let π 0 be any stationary policy A. LAZARIC Markov Decision Processes and Dynamic Programming 67/81

108 Dynamic Programming Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K A. LAZARIC Markov Decision Processes and Dynamic Programming 67/81

109 Dynamic Programming Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given πk, compute V π k. A. LAZARIC Markov Decision Processes and Dynamic Programming 67/81

110 Dynamic Programming Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given πk, compute V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. y A. LAZARIC Markov Decision Processes and Dynamic Programming 67/81

111 Dynamic Programming Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given πk, compute V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y A. LAZARIC Markov Decision Processes and Dynamic Programming 67/81

112 Dynamic Programming Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1, 2,..., K Policy evaluation given πk, compute V π k. Policy improvement: compute the greedy policy [ π k+1 (x) arg max a A r(x, a) + γ p(y x, a)v π k (y) ]. 3. Return the last policy π K y Remark: usually K is the smallest k such that V π k = V π k+1. A. LAZARIC Markov Decision Processes and Dynamic Programming 67/81

113 Dynamic Programming Policy Iteration: the Guarantees Proposition The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k+1 V π k, and it converges to π in a finite number of iterations. A. LAZARIC Markov Decision Processes and Dynamic Programming 68/81

114 Dynamic Programming Policy Iteration: the Guarantees Proof. From the definition of the Bellman operators and the greedy policy π k+1 V π k = T π k V π k T V π k = T π k+1 V π k, (1) A. LAZARIC Markov Decision Processes and Dynamic Programming 69/81

115 Dynamic Programming Policy Iteration: the Guarantees Proof. From the definition of the Bellman operators and the greedy policy π k+1 V π k = T π k V π k T V π k = T π k+1 V π k, (1) and from the monotonicity property of T π k+1, it follows that V π k T π k+1 V π k, T π k+1 V π k (T π k+1 ) 2 V π k,... (T π k+1 ) n 1 V π k (T π k+1 ) n V π k,... A. LAZARIC Markov Decision Processes and Dynamic Programming 69/81

116 Dynamic Programming Policy Iteration: the Guarantees Proof. From the definition of the Bellman operators and the greedy policy π k+1 V π k = T π k V π k T V π k = T π k+1 V π k, (1) and from the monotonicity property of T π k+1, it follows that V π k T π k+1 V π k, T π k+1 V π k (T π k+1 ) 2 V π k,... (T π k+1 ) n 1 V π k (T π k+1 ) n V π k,... Joining all the inequalities in the chain we obtain V π k lim n (T π k+1 ) n V π k = V π k+1. A. LAZARIC Markov Decision Processes and Dynamic Programming 69/81

117 Dynamic Programming Policy Iteration: the Guarantees Proof. From the definition of the Bellman operators and the greedy policy π k+1 V π k = T π k V π k T V π k = T π k+1 V π k, (1) and from the monotonicity property of T π k+1, it follows that V π k T π k+1 V π k, T π k+1 V π k (T π k+1 ) 2 V π k,... (T π k+1 ) n 1 V π k (T π k+1 ) n V π k,... Joining all the inequalities in the chain we obtain V π k lim n (T π k+1 ) n V π k = V π k+1. Then (V π k ) k is a non-decreasing sequence. A. LAZARIC Markov Decision Processes and Dynamic Programming 69/81

118 Dynamic Programming Policy Iteration: the Guarantees Proof (cont d). Since a finite MDP admits a finite number of policies, then the termination condition is eventually met for a specific k. Thus eq. 1 holds with an equality and we obtain V π k = T V π k and V π k = V which implies that π k is an optimal policy. A. LAZARIC Markov Decision Processes and Dynamic Programming 70/81

119 Dynamic Programming Exercise: Convergence Rate Read the more refined convergence rates in: Improved and Generalized Upper Bounds on the Complexity of Policy Iteration by B. Scherrer. A. LAZARIC Markov Decision Processes and Dynamic Programming 71/81

120 Dynamic Programming Policy Iteration Notation. For any policy π the reward vector is r π (x) = r(x, π(x)) and the transition matrix is [P π ] x,y = p(y x, π(x)) A. LAZARIC Markov Decision Processes and Dynamic Programming 72/81

121 Dynamic Programming Policy Iteration: the Policy Evaluation Step Direct computation. For any policy π compute V π = (I γp π ) 1 r π. Complexity: O(N 3 ) (improvable to O(N )). A. LAZARIC Markov Decision Processes and Dynamic Programming 73/81

122 Dynamic Programming Policy Iteration: the Policy Evaluation Step Direct computation. For any policy π compute V π = (I γp π ) 1 r π. Complexity: O(N 3 ) (improvable to O(N )). Exercise: prove the previous equality. A. LAZARIC Markov Decision Processes and Dynamic Programming 73/81

123 Dynamic Programming Policy Iteration: the Policy Evaluation Step Direct computation. For any policy π compute V π = (I γp π ) 1 r π. Complexity: O(N 3 ) (improvable to O(N )). Exercise: prove the previous equality. Iterative policy evaluation. For any policy π lim T π V 0 = V π. n Complexity: An ɛ-approximation of V π requires O(N 2 log 1/ɛ log 1/γ ) steps. A. LAZARIC Markov Decision Processes and Dynamic Programming 73/81

124 Dynamic Programming Policy Iteration: the Policy Evaluation Step Direct computation. For any policy π compute V π = (I γp π ) 1 r π. Complexity: O(N 3 ) (improvable to O(N )). Exercise: prove the previous equality. Iterative policy evaluation. For any policy π lim T π V 0 = V π. n Complexity: An ɛ-approximation of V π requires O(N 2 log 1/ɛ log 1/γ ) steps. Monte-Carlo simulation. In each state x, simulate n trajectories ((x i t ) t 0, ) 1 i n following policy π and compute ˆV π (x) 1 n n γ t r(xt i, π(xt i )). i=1 t 0 Complexity: In each state, the approximation error is O(1/ n). A. LAZARIC Markov Decision Processes and Dynamic Programming 73/81

125 Dynamic Programming Policy Iteration: the Policy Improvement Step If the policy is evaluated with V, then the policy improvement has complexity O(N A ) (computation of an expectation). A. LAZARIC Markov Decision Processes and Dynamic Programming 74/81

126 Dynamic Programming Policy Iteration: the Policy Improvement Step If the policy is evaluated with V, then the policy improvement has complexity O(N A ) (computation of an expectation). If the policy is evaluated with Q, then the policy improvement has complexity O( A ) corresponding to π k+1 (x) arg max Q(x, a), a A A. LAZARIC Markov Decision Processes and Dynamic Programming 74/81

127 Dynamic Programming Comparison between Value and Policy Iteration Value Iteration Pros: each iteration is very computationally efficient. Cons: convergence is only asymptotic. Policy Iteration Pros: converge in a finite number of iterations (often small in practice). Cons: each iteration requires a full policy evaluation and it might be expensive. A. LAZARIC Markov Decision Processes and Dynamic Programming 75/81

128 Dynamic Programming Exercise: Review Extensions to Standard DP Algorithms Modified Policy Iteration λ-policy Iteration A. LAZARIC Markov Decision Processes and Dynamic Programming 76/81

129 Dynamic Programming Exercise: Review Linear Programming Linear Programming: a one-shot approach to computing V A. LAZARIC Markov Decision Processes and Dynamic Programming 77/81

130 Conclusions Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC Markov Decision Processes and Dynamic Programming 78/81

131 Conclusions Things to Remember The Markov Decision Process framework A. LAZARIC Markov Decision Processes and Dynamic Programming 79/81

132 Conclusions Things to Remember The Markov Decision Process framework The discounted infinite horizon setting A. LAZARIC Markov Decision Processes and Dynamic Programming 79/81

133 Conclusions Things to Remember The Markov Decision Process framework The discounted infinite horizon setting State and state-action value function A. LAZARIC Markov Decision Processes and Dynamic Programming 79/81

134 Conclusions Things to Remember The Markov Decision Process framework The discounted infinite horizon setting State and state-action value function Bellman equations and Bellman operators A. LAZARIC Markov Decision Processes and Dynamic Programming 79/81

135 Conclusions Things to Remember The Markov Decision Process framework The discounted infinite horizon setting State and state-action value function Bellman equations and Bellman operators The value and policy iteration algorithms A. LAZARIC Markov Decision Processes and Dynamic Programming 79/81

136 Conclusions Bibliography I R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, N.J., D.P. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, W. Fleming and R. Rishel. Deterministic and stochastic optimal control. Applications of Mathematics, 1, Springer-Verlag, Berlin New York, R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, M.L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, Etats-Unis, A. LAZARIC Markov Decision Processes and Dynamic Programming 80/81

137 Conclusions Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Analysis of Classification-based Policy Iteration Algorithms

Analysis of Classification-based Policy Iteration Algorithms Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Understanding (Exact) Dynamic Programming through Bellman Operators

Understanding (Exact) Dynamic Programming through Bellman Operators Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11 Overview 1 Value

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Total Expected Discounted Reward MDPs: Existence of Optimal Policies Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks Seminar Cornell University Performance

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

CS 598 Statistical Reinforcement Learning. Nan Jiang

CS 598 Statistical Reinforcement Learning. Nan Jiang CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL

More information

Algorithms for MDPs and Their Convergence

Algorithms for MDPs and Their Convergence MS&E338 Reinforcement Learning Lecture 2 - April 4 208 Algorithms for MDPs and Their Convergence Lecturer: Ben Van Roy Scribe: Matthew Creme and Kristen Kessel Bellman operators Recall from last lecture

More information

Optimal Control of an Inventory System with Joint Production and Pricing Decisions

Optimal Control of an Inventory System with Joint Production and Pricing Decisions Optimal Control of an Inventory System with Joint Production and Pricing Decisions Ping Cao, Jingui Xie Abstract In this study, we consider a stochastic inventory system in which the objective of the manufacturer

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Markov Decision Processes and their Applications to Supply Chain Management

Markov Decision Processes and their Applications to Supply Chain Management Markov Decision Processes and their Applications to Supply Chain Management Jefferson Huang School of Operations Research & Information Engineering Cornell University June 24 & 25, 2018 10 th Operations

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of

More information

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME SAUL D. JACKA AND ALEKSANDAR MIJATOVIĆ Abstract. We develop a general approach to the Policy Improvement Algorithm (PIA) for stochastic control problems

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Efficient Approximate Planning in Continuous Space Markovian Decision Problems

Efficient Approximate Planning in Continuous Space Markovian Decision Problems Efficient Approximate Planning in Continuous Space Markovian Decision Problems Csaba Szepesvári a,, a Mindmaker Ltd. Budapest HUGARY - 2 Konkoly-Thege M. u. 29-33. E-mail: szepes@mindmaker.hu Monte-Carlo

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}. Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison,

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Basic Deterministic Dynamic Programming

Basic Deterministic Dynamic Programming Basic Deterministic Dynamic Programming Timothy Kam School of Economics & CAMA Australian National University ECON8022, This version March 17, 2008 Motivation What do we do? Outline Deterministic IHDP

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

Gradient Estimation for Attractor Networks

Gradient Estimation for Attractor Networks Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks

More information