Markov Decision Processes and Dynamic Programming

Size: px
Start display at page:

Download "Markov Decision Processes and Dynamic Programming"

Transcription

1 Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric lazaric/webpage/teaching.html Objectives of the lecture 1. Understand: Markov decision processes, Bellman equations and Bellman operators. 2. Use: dnamic programming algorithms. 1 The Markov Decision Process 1.1 Definitions Definition 1 (Markov chain). Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dnamic sstem (x t ) t N X is a Markov chain if P(x t+1 = x x t, x t 1,..., x 0 ) = P(x t+1 = x x t ), (1) so that all the information needed to predict (in probabilit) the future is contained in the current state (Markov propert). Given an initial state x 0 X, a Markov chain is defined b the transition probabilit p such that p( x) = P(x t+1 = x t = x). (2) Remark: notice that in some cases we can turn a higher-order Markov process into a Markov process b including the past as a new state variable. For instance, in the control of an inverted pendulum, the state that can be observed is onl the angular position θ t. In this case the sstem is non-markov since the next position depends on the previous position but also on the angular speed, which is defined as dθ t = θ t θ t 1 (as a first approximation). Thus, if we define the state space as x t = (θ t, dθ t ) we obtain a Markov chain. Definition 2 (Markov decision process [Bellman, 1957, Howard, 1960, Fleming and Rishel, 1975, Puterman, 1994, Bertsekas and Tsitsiklis, 1996]). A Markov decision process is defined as a tuple M = (X, A, p, r) where X is the state space (finite, countable, continuous), 1 A is the action space (finite, countable, continuous), 1 In most of our lectures it can be consider as finite such that X = N. 1

2 2 Markov Decision Processes and Dnamic Programming p( x, a) is the transition probabilit (i.e., environment dnamics) such that for an x X, X, and a A p( x, a) = P(x t+1 = x t = x, a t = a), is the probabilit of observing a next state when action a is taking in x, r(x, a, ) is the reinforcement obtained when taking action a, a transition from a state x to a state is observed. 2 Definition 3 (Polic). At time t N a decision rule π t is a mapping from states to actions, in particular it can be Deterministic: π t : X A, π t (x) denotes the action chosen at state x at time t, Stochastic: π t : X (A), π t (a x) denotes the probabilit of taking action a at state x at time t. A polic (strateg, plan) is a sequence of decision rules. In particular, we have Non-stationar: π = (π 0, π 1, π 2,... ), Stationar (Markovian): π = (π, π, π,... ). Remark: an MDP M together with a deterministic (stochastic) stationar polic π forms a dnamic process (x t ) t N (obtained b taking the actions a t = π(x t )) which corresponds to a Markov chain of state X and transition probabilit p( x) = p( x, π(x)). Time horizons Finite time horizon T : the problem is characterized b a deadline at time T (e.g., the end of the course) and the agent onl focuses on the sum of the rewards up to that time. Infinite time horizon with discount: the problem never terminates but rewards which are closer in time receive a higher importance. Infinite time horizon with absorbing (terminal) state: the problem never terminates but the agent will eventuall reach a termination state. Infinite time horizon with average reward: the problem never terminates but the agent onl focuses on the average of the rewards. Definition 4 (The state value function). Depending on the time horizon we consider we have Finite time horizon T V π (t, x) = E [ T 1 r(x s, π s (x s )) + R(x T ) x t = x; π ], (3) s=t where R is a reward function for the final state. 2 Most of the time we will use either r(x, a) (which can be considered as the expected value of r(x, a, )) or r(x) as a state-onl function.

3 Markov Decision Processes and Dnamic Programming 3 Infinite time horizon with discount V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ], (4) t=0 where 0 γ < 1 is a discount factor (i.e., small = focus on short-term rewards, big = focus on long term rewards). 3 Infinite time horizon with absorbing (terminal) states V π (x) = E [ T r(x t, π(x t )) x 0 = x; π ], (5) where T is the first (random) time when the agent achieves a absorbing state. Infinite time horizon with average reward t=0 V π 1 T 1 (x) = lim T T E[ r(x t, π(x t )) x 0 = x; π ]. (6) t=0 Remark: the previous expectations refer to all the possible stochastic trajectories. More precisel, let us consider an MDP M = (X, A, p, r), if an agent follows a non-stationar polic π from the state x 0, then it observes the random sequence of states (x 0, x 1, x 2,...) and the corresponding sequence of rewards (r 0, r 1, r 2,...) where r t = r(x t, π t (x t )). The corresponding value function (in the infinite horizon with discount setting) is then defined as [ ] V π (x) = E (x1,x 2,...) γ t r(x t, π(x t )) x 0 = x; π, t=0 where each x t p( x t 1, a t = π(x t )) is a random realization from the transition probabilit of the MDP. Definition 5 (Optimal polic and optimal value function). The solution to an MDP is an optimal polic π satisfing π arg max π Π V π in all the states x X, where Π is some polic set of interest. The corresponding value function is the optimal value function V = V π. Remark: π arg max( ) and not π = arg max( ) because an MDP ma admit more than one optimal polic. Beside the state value function, we can also introduce an alternative formulation, the state-action value function. For infinite horizon discounted problems we have the following definition (similar for other settings). Definition 6 (The state-action value function). In discounted infinite horizon problems, for an polic π, the state-action value function (or Q-function) Q π : X A R is defined as [ ] Q π (x, a) = E γ t r(x t, a t ) x 0 = x, a 0 = a, a t = π(x t ), t 1, t 0 and the corresponding optimal Q-function is Q (x, a) = max π Qπ (x, a). 3 Mathematical interpretation: for an γ [0, 1) the series alwas converge (for bounded rewards).

4 4 Markov Decision Processes and Dnamic Programming The relationships between the V-function and the Q-function are: Q π (x, a) = r(x, a) + γ X p( x, a)v π () V π (x) = Q π (x, π(x)) Q (x, a) = r(x, a) + γ X p( x, a)v () V (x) = Q (x, π (x)) = max a A Q (x, a). Finall, from an optimal Q-function we can deduce the optimal polic as π (x) arg max a A Q (x, a). 1.2 Examples Example 1 (the MVA student dilemma). 1 p=0.5 Rest r=1 Rest 0.4 r= 10 r=0 Work Work Rest 0.6 r=100 r= Work 0.5 r= Rest Work r= 1000 Model: all the transitions are Markov since the probabilit of transition onl depend on the previous state. The states x 5, x 6, x 7 are terminal (absorbing) states. Setting: infinite horizon with terminal states. Objective: find the polic that maximizes the expected sum of rewards before achieving a terminal state. Problem 1 : if a student knows the transition probabilities and the rewards, how does he/she compute the optimal strateg? Solution:

5 Markov Decision Processes and Dnamic Programming 5 V = r=0 0.5 Rest Work r= 1 V = p= Work 0.5 r= Rest 0.6 Work 0.5 r= 10 V = V = Rest Rest Work r=100 r= 10 V = 10 5 V = r= 1000 V = V 5 = 10, V 6 = 100, V 7 = 1000 V 4 = V V V 3 = V V V 2 = V V 1 V 1 = max{0.5v V 1, 0.5V V 1 } V 1 = V 2 = 88.3 Problem 2 : what if the student doesn t know the MDP? Example 2 (The retail store management problem). At each month t, a store contains x t items of a specific goods and the demand for that goods is D t. At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that The cost of maintaining an inventor of x is h(x). The cost to order a items is C(a). The income for selling q items is f(q). If the demand D is bigger than the available inventor x, the customers that cannot be served will leave and move to another store. The income of the remaining inventor at the end of the ear is g(x). Constraint: the store has a maximum capacit M. Objective: the performance of the manager is evaluated as the profit obtained over an horizon of a ear (T = 12 months). Problem: How do we formalize the problem? Solution: we define the following simplified model using the MDP formalism.

6 6 Markov Decision Processes and Dnamic Programming State space: x X = {0, 1,..., M}. Action space: it is not possible to order more items that the capacit of the store, then the action space should depend on the current state. Formall, at statex, a A(x) = {0, 1,..., M x}. Dnamics: x t+1 = [x t + a t D t ] +. Problem: the dnamics should be Markov and stationar. The demand D t is stochastic and timeindependent. Formall, D t i.i.d. D. Reward: r t = C(a t ) h(x t + a t ) + f([x t + a t x t+1 ] + ). Objective function: E [ T 1 t=1 r t + g(x T ) ] Example 3 (The parking problem). A driver wants to park his car as close as possible to the restaurant. 1 2 Reward t T p(t) Restaurant Reward 0 We know that The driver cannot see whether a place is available unless he/she is in front of it. There are P places. At each place i the driver can either move to the next place or park (if the place is available). The closer to the restaurant the parking, the higher the satisfaction. If the driver doesn t park anwhere, then he/she leaves the restaurant and has to find another one. Objective: maximize the satisfaction. Problem: How do we formalize the problem? Solution: we define the following simplified model using the MDP formalism. State space: x X = {(1, T ), (1, A), (2, T ), (2, A),..., (P, T ), (P, A), parked, left}. Each of the P places can be (A)vailable or (T )aken. Whenever the driver parks, he reaches the state parked, while if he never parks then the state becomes left. Both these states are terminal states. Action space: A(x) = {park, continue} if x = (, A) or A(x) = {continue} if x = (, T ). Dnamics: (see graph). We assume that the probabilit of a place i to be available is ρ(i). Reward: for an i [1, P ] we have r(x, park, parked) = i and 0 otherwise.

7 Markov Decision Processes and Dnamic Programming 7 Objective function: E [ T t=1 r t] with T the random time when a terminal state is reached. Rew 1 Rew 2 Rew T Park Park Park p(1) L Cont p(2) L Cont p(t) L Cont Rew 0 D 1 p(1) O 1 p(2) O 1 p(t) O Rew T Example 4 (The tetris game). Model: State space: configuration of the wall and next piece and terminal state when the well reach the maximum height. Action space: position and orientation of the current space in the wall. Dnamics: new configuration of the well and new random piece. Reward: number of deleted rows. Objective function: E [ T t=1 r t] with T the random time when a terminal state is reached. (remark: it has been proved that the game eventuall terminates with probabilit 1 for an plaing strateg). Problem: Compute the optimal strateg. Solution: unknown! The state space is huge! X = for a problem with maximum height 20, width 10 and 7 different pieces. 1.3 Finite Horizon Problems Proposition 1. For an horizon T and a (non-stationar) polic π = (π 0,..., π T 1 ), the state value function at a state x X at time t {0,..., T } satisfies the Bellman equation: V π (t, x) = E [ T 1 r(x s, π s (x s )) + R(x T ) x t = x; π ]. (7) s=t

8 8 Markov Decision Processes and Dnamic Programming Proof. Recalling equation 3 and Def. 5, we have that V π (t, x) = E [ T 1 r(x s, π s (x s )) + R(x T ) x t = x; π ] s=t [ T (a) 1 = r(x, π t (x)) + E xt+1,x t+2,...,x T 1 = r(x, π t (x)) + X (b) = r(x, π t (x)) + X s=t+1 ] r(x s, π s (x s )) + R(x T ) x t = x; π [ T 1 P[x t+1 = x t = x; a t = π(x t )]E xt+2,...,x T 1 p( x, π(x))e [ T 1 s=t+1 (c) = r(x, π t (x)) + X p( x, π(x))v π (t + 1, ), s=t+1 r(x s, π s (x s )) + R(x T ) x t+1 = x; π ] ] r(x s, π s (x s )) + R(x T ) x t+1 = ; π where (a) The expectation is conditioned on x t = x. (b) Application of the law of total expectation. (c) Definition of the MDP dnamics p and of the value function. Bellman s Principle of Optimalit [Bellman, 1957]: An optimal polic has the propert that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal polic with regard to the state resulting from the first decision. Proposition 2. The optimal value function V (t, x) (i.e., V = max π V π ) is the solution to the optimal Bellman equation: V (t, x) = [ max r(x, a) + p( x, a)v (t + 1, ) ], with 0 t < T a A X (8) V (T, x) = R(x), and the polic is an optimal polic. πt [ (x) arg max r(x, a) + p( x, a)v (t + 1, ) ], with 0 t < T. a A X

9 Markov Decision Processes and Dnamic Programming 9 Proof. B definition we have V (T, x) = R(x), then V is defined b backward propagation for an t < T. An polic π applied from time t < T on at state x can be written as π = (a, π ) with a A is the action taken at time t in x and π = (π t+1,..., π T 1 ). Then we can show that V (t, x) (a) = max π r(x s, π s (x s )) + R(x T ) x t = x; π ] E[ T 1 s=t [ (b) = max r(x, a) + ] p( x, a)v π (t + 1, ) (a,π ) X (c) = max a A (d) = max a A [ r(x, a) + X p( x, a) max π V π (t + 1, ) [ r(x, a) + p( x, a)v (t + 1, ) ]. X ] where (a) : definition of value function from equation 3, (b) : decomposition of polic π = (a, π ) and recursive definition of the value function, (c) : follows from Trivial inequalit max π p( x, a)v π (t + 1, ) p( x, a) max π V π (t + 1, ) Let π = ( π t+1,... ) a polic such that π t+1 () = arg max b A max (πt+2,... ) V (b,πt+2,... ) (t + 1, ). Then p( x, a) max V π (t + 1, ) = p( x, a)v π (t + 1, ) max p( x, a)v π (t + 1, ). π π (d) : definition of optimal value function. Finall the optimal polic simpl takes the action for which the maximum is attained at each iteration. Remark. The previous Bellman equations can be easil extended to the case of state-action value functions. Parking example. Let V (p, A) (resp. V (p, T )) the optimal value function when at position p and the place is available (resp. taken). The optimal value function can be constructed using the optimal Bellman equations as: V (P, A) = max{p, 0} = P, V (P, T ) = 0, V (P 1, A) = max{p 1, ρ(p )V (P, A) + (1 ρ(p ))V (P, T ))} V (p, A) = max{t, ρ(p + 1)V (p + 1, A) + (1 ρ(p + 1))V (p + 1, T )}, where the max corresponds to the two possible choices (park or continue) and the corresponding optimal polic can be derived as the arg max of the previous maxima.

10 10 Markov Decision Processes and Dnamic Programming 1.4 Discounted Infinite Horizon Problems Bellman Equations Proposition 3. For an stationar polic π = (π, π,... ), the state value function at a state x X satisfies the Bellman equation: V π (x) = r(x, π(x)) + γ p( x, π(x))v π (). (9) Proof. For an polic π, V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ] t 0 = r(x, π(x)) + E [ γ t r(x t, π(x t )) x 0 = x; π ] = r(x, π(x)) + γ = r(x, π(x)) + γ t 1 P(x 1 = x 0 = x; π(x 0 ))E [ γ t 1 r(x t, π(x t )) x 1 = ; π ] p( x, π(x))v π (). t 1 Proposition 4. The optimal value function V (i.e., V = max π V π ) is the solution to the optimal Bellman equation: V [ (x) = max r(x, a) + γ p( x, a)v () ]. (10) a A Proof. For an polic π = (a, π ) (possibl non-stationar), V (x) (a) = max π E[ γ t r(x t, π(x t )) x 0 = x; π ] t 0 [ (b) = max r(x, a) + γ (a,π ) (c) = max a (d) = max a [ r(x, a) + γ [ r(x, a) + γ ] p( x, a)v π () ] p( x, a) max V π () π ] p( x, a)v (). (11) where (a) : definition of value function from equation 4,

11 Markov Decision Processes and Dnamic Programming 11 (b) : decomposition of polic π = (a, π ) and recursive definition of the value function, (c) : follows from Trivial inequalit max π p( x, a)v π () p( x, a) max V π () π Let π() = arg max π V π (). Then p( x, a) max V π () = π p( x, a)v π () max π p( x, a)v π (). (d) : definition of optimal value function. The equivalent Bellman equations for state-action value functions are: Q π (x, a) = r(x, a) + γ X p( x, a)q π (, π()) Q (x, a) = r(x, a) + γ X p( x, a) max b A Q (, b) Bellman Operators Notation. W.l.o.g. from this moment on we consider a discrete state space X = N, so that for an polic π, V π is a vector of size N (i.e., V π R N ). Definition 7. For an W R N, the Bellman operator T π : R N R N is defined as T π W (x) = r(x, π(x)) + γ p( x, π(x))w (), (12) and the optimal Bellman operator (or dnamic programming operator) is defined as [ T W (x) = max r(x, a) + γ p( x, a)w () ]. (13) a A Proposition 5. The Bellman operators enjo a number of properties: 1. Monotonicit: for an W 1, W 2 R N, if W 1 W 2 component-wise, then T π W 1 T π W 2, T W 1 T W Offset: for an scalar c R, T π (W + ci N ) = T π W + γci N, T (W + ci N ) = T W + γci N,

12 12 Markov Decision Processes and Dnamic Programming 3. Contraction in L -norm: for an W 1, W 2 R N 4. Fixed point: For an polic π T π W 1 T π W 2 γ W 1 W 2, T W 1 T W 2 γ W 1 W 2. V π is the unique fixed point of T π, V is the unique fixed point of T. Furthermore for an W R N and an stationar polic π lim (T π ) k W = V π, k lim (T k )k W = V. Proof. Monotonicit (1) and offset (2) directl follow from the definitions. The contraction propert (3) holds since for an x X we have T W 1 (x) T W 2 (x) = (a) max a = γ max a max a [ r(x, a) + γ p( x, a)w 1 () ] [ max r(x, a ) + γ a [ r(x, a) + γ p( x, a)w 1 () ] [ r(x, a) + γ p( x, a) W 1 () W 2 () γ W 1 W 2 max a p( x, a) = γ W 1 W 2, where in (a) we used max a f(a) max a g(a ) max a (f(a) g(a)). p( x, a)w 2 () ] p( x, a )W 2 () ] The fixed point propert follows from the Bellman equations (eq. 9 and 10) and Banach fixed point theorem (see Theorem 11). The propert regarding the convergence of the limit of the Bellman operators is actuall used in the proof of the Banach theorem. [ Remark 1 (optimal polic): An stationar polic π (x) arg max a A r(x, a) + γ p( x, a)v () ] is optimal. In fact, from the definition of π we have that T π V = T V = V where the first equalit follows from the fact that the optimal Bellman operator coincides with the action taken b π and the second from the fixed point propert of T. Furthermore, b the fixed point propert of T π we have that V π is the fixed point of T π which is also unique. Then V π = V thus impling that π is an optimal polic. Remark 2 (value/polic iteration): most of the dnamic programming algorithms studied in Section 2 will heavil rel on propert (4). Bellman operators for Q-functions.

13 Markov Decision Processes and Dnamic Programming 13 Both the Bellman T π and the optimal Bellman operators T can be extended to Q-functions. Thus for an function 4 W : X A R, we have T π W (x, a) = r(x, a) + γ X p( x, a)w (, π()), T W (x, a) = r(x, a) + γ p( x, a) max W (, b). b A X This allows to extend also all the properties in Proposition 5, notabl that Q π (resp. Q ) is the fixed point of T π (resp. T ). 1.5 Undiscounted Infinite Horizon Problems Bellman equations Recall the definition of value function for infinite horizon problems with absorbing states in eq. 5: V π (x) = E [ r(x t, π(x t )) x 0 = x; π ] = E [ T r(x t, π(x t )) x 0 = x; π ], t=0 where T is the first (random) time when the agent achieves a absorbing state. The equivalence follows from the fact that we assume that once the absorbing state is achieved the sstem stas in that state indefinitel long with a constant reward of 0. Definition 8. A stationar polic π is proper if there exists an integer n N such that from an initial state x X the probabilit of achieving the terminal state (denoted b x) after n steps is strictl positive. That is ρ π = max x P(x n x x 0 = x, π) < 1. Proposition 6. For an proper polic π with parameter ρ π after n steps, the value function is bounded as V π r max ρ t/n π. t 0 t=0 Proof. We have that b def. 8 Then for an t N P(x 2n x x 0 = x, π) = P(x 2n x x n x, π) P(x n x x 0 = x, π) ρ 2 π. P(x t x x 0 = x, π) ρ t/n π, which implies that eventuall the terminal state x is achieved with probabilit 1. Then V π = max E[ r(x t, π(x t )) x 0 = x; π ] r max P(x t x x 0 = x, π) nr max + r max ρ t/n π. x X t=0 t>0 t n 4 It is simpler to state W as function rather than a vector as done for value functions.

14 14 Markov Decision Processes and Dnamic Programming Bellman operators Assumption. There exists at least one proper polic and for an non-proper polic π there exists at least one state x where the corresponding value function is negativel unbounded, i.e., V π (x) =, which corresponds to the existence of a ccle with onl negative rewards. Proposition 7. [Bertsekas and Tsitsiklis, 1996] Under the previous assumption, the optimal value function is bounded, i.e., V < and it is the unique fixed point of the optimal Bellman operator T such that for an vector W R n T W (x) = max a A Furthermore, we have that V = lim k (T ) k W. [ r(x, a) + p( x, a)w ()]. Proposition 8. Let all the policies π be proper, then there exist a vector µ R N with µ > 0 and a scalar β < 1 such that, x, X N, a A, p( x, a)µ() βµ(x). Then it follows that both operators T and T π are contraction in the weighted norm L,µ, that is T W 1 T W 2,µ β W 1 W 2,µ. Proof. Let µ be the maximum (over all policies) of the average time to the terminal goal. This can be easil casted to a MDP where for an action and an state the rewards are 1 (i.e., for an x X and a A, r(x, a) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to the dnamic programming equation µ(x) = 1 + max a p( x, a)µ(). Then µ(x) 1 and for an a A, µ(x) 1 + p( x, a)µ(). Furthermore, p( x, a)µ() µ(x) 1 βµ(x), for β = max x µ(x) 1 µ(x) From this definition of µ and β we obtain the contraction propert of T (similar for T π ) in norm L,µ : T W 1 T W 2,µ = max x max x,a max x,a < 1. T W 1 (x) T W 2 (x) µ(x) p( x, a) W 1 () W 2 () µ(x) p( x, a)µ() W 1 W 2 µ µ(x) β W 1 W 2 µ

15 Markov Decision Processes and Dnamic Programming 15 2 Dnamic Programming 2.1 Value Iteration (VI) The idea. We build a sequence of value functions. Let V 0 be an vector in R N, then iterate the application of the optimal Bellman operator so that given V k at iteration k we compute V k+1 = T V k. From Proposition 5 we have that lim k V k = V, thus a repeated application of the Bellman operator make the initial guess V 0 closer and closer to the actual optimal value function. At an iteration K, the algorithm returns the greed polic π K (x) arg max a A [ r(x, a) + γ p( x, a)v K () ]. Guarantees. Using the contraction propert of the Bellman operator we obtain the convergence of V 0 to V. In fact, V k+1 V = T V k T V γ V k V γ k+1 V 0 V 0. This also provides the convergence rate of VI. Let ɛ > 0 be a desired level of accurac in approximating V and r r max, then after at most K = log(r max/ɛ) log(1/γ) iterations VI returns a value function V K such that V K V < ɛ. In fact, after K iterations we have V K V < γ K+1 V 0 V γ K+1 r max ɛ. Computational complexit. One application of the optimal Bellman operator takes O(N 2 A) operations. Remark: Notice that the previous guarantee is for the value function returned b VI and not the corresponding polic π K. Q-iteration. Exactl the same algorithm can be applied to Q-functions using the corresponding optimal Bellman operator. Implementation. There exist several implementations depending on the order used to update the different components of V k (i.e., the states). For instance, in asnchronous VI, at each iteration k, one single state x k is chosen and onl the corresponding component of V k is updated, i.e., V k+1 (x k ) = T V k (x k ), while all the other states remain unchanged. If all the states are selected infinitel often, then V k V. Pros: each iteration is ver computationall efficient. Cons: convergence is onl asmptotic.

16 16 Markov Decision Processes and Dnamic Programming 2.2 Polic Iteration (PI) Notation. Further extending the vector notation, for an polic π we introduce the reward vector r π R N as r π (x) = r(x, π(x)) and the transition matrix P π R N N as P π (x, ) = p( x, π(x)) The idea. We build a sequence of policies. Let π 0 be an stationar polic. At each iteration k we perform the two following steps: 1. Polic evaluation given π k, compute V π k. 2. Polic improvement: we compute the greed polic π k+1 from V π k as: π k+1 (x) arg max a A The iterations continue until V π k = V π k+1. [ r(x, a) + γ p( x, a)v π k () ]. Remark: Notice π k+1 is called greed since it is maximizing the current estimate of the future rewards, which corresponds to the application of the optimal Bellman operator, since T π k+1 V π k = T V π k. Remark: The PI algorithm is often seen as an actor-critic algorithm. Guarantees. Proposition 9. The polic iteration algorithm generates a sequences of policies with non-decreasing performance (i.e., V π k+1 V π k )) and it converges to π in a finite number of iterations. Proof. From the definition of the operators T, T π k, T π k+1 and of the greed polic π k+1, we have that and from the monotonicit propert of T π k+1, it follows that Joining all the inequalities in the chain we obtain V π k = T π k V π k T V π k = T π k+1 V π k, (14) V π k T π k+1 V π k, T π k+1 V π k (T π k+1 ) 2 V π k,... (T π k+1 ) n 1 V π k (T π k+1 ) n V π k, V π k... lim n (T π k+1 ) n V π k = V π k+1. Then (V π k ) k is a non-decreasing sequence. Furthermore, in a finite MDP we have a finite number of policies, then the termination condition is alwas met for a specific k. Thus eq. 14 holds with an equalit and we obtain V π k = T V π k and V π k = V which implies that π k is an optimal polic.

17 Markov Decision Processes and Dnamic Programming 17 Remark: More recent and refined proofs of the convergence rate of PI are available. Polic evaluation. At each iteration k the value function V π k corresponding to the current polic π k is computed. There are a number of different approaches to compute V π k. Direct computation. From the vector notation introduced before and the Bellman equation in Proposition 4 we have that for an polic π: V π = r π + γp π V π. B rearranging the terms we write the previous equation as (I γp π )V π = r π. Since P π is a stochastic matrix, then all its eigenvalues are 1. Thus the eigenvalues of the matrix (I γp π ) are bounded b 1 γ, which guarantees that it is invertible. Then we can compute V π as V π = (I γp π ) 1 r π. The inversion can be done with different methods. For instance, the Gauss-Jordan elimination algorithm has a complexit O(N 3 ), which can be lowered to O(N ) using Strassen s algorithm. Iterative polic evaluation. For an polic π from the properties of the polic Bellman operator T π, for an V 0 we have that lim n T π V 0 = V π. Thus an approximation of V π could be computed b re-iterating the application T π for n steps. In particular, in order to achieve an ɛ-approximation of V π we need O(N 2 log 1/ɛ log 1/γ ) steps. Monte-Carlo simulation. In each state x, we simulate n trajectories ((x i t) t 0, ) 1 i n following polic π, where for an i = 1,..., n, x i 0 = x and x i t+1 p( x i t, π(x i t)). We compute ˆV π (x) 1 n The approximation error is of order O(1/ n). Temporal-difference (TD) see next lecture. n γ t r(x i t, π(x i t)). i=1 t 0 Polic improvement. The computation of π k+1 requires the computation of an expectation w.r.t. the next state, which might be as expensive as O(N A ). If we move to polic iteration of Q-functions, then the polic improvement step simplifies to π k+1 (x) arg max Q(x, a), a A which has a computational cost of O( A ). Furthermore, this could allow to compute the greed polic even when the dnamics p is not known. Pros: converge in a finite number of iterations (often small in practice). Cons: each iteration requires a full polic evaluation and it might be expensive.

18 18 Markov Decision Processes and Dnamic Programming 2.3 Linear Programming (LP) Geometric interpretation of PI. We first provide a different interpretation of PI as a Newton method tring to find the zero of the Bellman residual operator. Let B = T I the Bellman residual operator. From the definition of V as the fixed point of T, it follows that BV = 0. Thus, computing V as the fixed point of T is equivalent to compute the zero of B. We first prove the following proposition. Proposition 10. Let B = T I the Bellman residual operator and B its derivative. Then sequence of policies π k generated b PI satisfies V π k+1 = V π k (γp π k+1 I) 1 [T π k+1 V π k V π k ] = V π k [B ] 1 BV π k, which coincides with the standard formulation of the Newton s method. Proof. B the definition of V π k and the Bellman operators we have V π k+1 = (I γp π k+1 ) 1 r π k+1 V π k + V π k = V π k + (I γp π k+1 ) 1 [r π k+1 + (γp π k+1 I)V π k ] π 1 T V V TV V π 2 T V V π 3 T V V π 1 π 2 π V 3 V V = V* V Following the geometric interpretation, we have that V π k is the zero of the linear operator T π k I. Since the application V T V V = max π T π V V is convex, then we have that the Newton s method is guaranteed to converge for an V 0 such that T V 0 V 0 0. Linear Programming The previous intuition is at the basis of the observation that V is the smallest vector V such that V T V. In fact, V T V = V lim k (T )k V = V. Thus, we can compute V as the solution to the linear program: Min x V (x),

19 Markov Decision Processes and Dnamic Programming 19 Subject to (finite number of linear inequalities which defines a polhidron in R N ): V (x) r(x, a) + γ p( x, a)v (), x X, a A which contains N variables and N A constraints.

20 20 Markov Decision Processes and Dnamic Programming A Probabilit Theor Definition 9 (Conditional probabilit). Given two events A and B with P(B) > 0, the conditional probabilit of A given B is P(A B) P(A B) =. P(B) Similarl, if X and Y are non-degenerate and jointl continuous random variables with densit f X,Y (x, ) then if B has positive measure then the conditional probabilit is B x A P(X A Y B) = f X,Y (x, )dxd x f X,Y (x, )dxd. B Definition 10 (Law of total expectation). Given a function f and two random variables X, Y we have that E X,Y [ f(x, Y ) ] = EX [E Y [ f(x, Y ) X = x ] ]. B Norms, Contractions, and Banach s Fixed-Point Theorem Definition 11. Given a vector space V R d a function f : V R + 0 is a norm if an onl if If f(v) = 0 for some v V, then v = 0. For an λ R, v V, f(λv) = λ f(v). Triangle inequalit: For an v, u V, f(v + u) f(v) + f(u). A list of useful norms. L p -norm L -norm L µ,p -norm L µ,p -norm ( d ) 1/p v p = v i p. i=1 v = max 1 i d v i. ( d v µ,p = i=1 v i p ) 1/p. µ i v i v µ, = max. 1 i d µ i L 2,P -matrix norm (P is a positive definite matrix) v 2 P = v P v.

21 Markov Decision Processes and Dnamic Programming 21 Definition 12. A sequence of vectors v n V (with n N) is said to converge in norm to v V if lim v n v = 0. n Definition 13. A sequence of vectors v n V (with n N) is a Cauch sequence if lim sup v n v m = 0. n m n Definition 14. A vector space V equipped with a norm is complete if ever Cauch sequence in V is convergent in the norm of the space. Definition 15. An operator T : V V is L-Lipschitz if for an v, u V T v T u L u v. If L 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if v n v then T v n T v. Definition 16. A vector v V is a fixed point of the operator T : V V if T v = v. Proposition 11. Let V be a complete vector space equipped with the norm and T : V V be a γ-contraction mapping. Then 1. T admits a unique fixed point v. 2. For an v 0 V, if v n+1 = T v n then v n v with a geometric convergence rate: v n v γ n v 0 v. Proof. We first derive the fundamental contraction propert. For an v, u V: v u = v T v + T v T u + T u u (a) v T v + T v T u + T u u (b) v T v + γ v u + T u u, where (a) follows from the triangle inequalit and (b) from the contraction propert of T. Rearranging the terms we obtain: v u v T v + T u u. (15) 1 γ If v and u are both fixed points then T v v = 0 and T u u = 0, thus from the previous inequalit v u 0 which implies v = u. Then T admits at most one fixed point.

22 22 Markov Decision Processes and Dnamic Programming Let v 0 V, for an n, m N an iterative application of eq. 15 gives T n v 0 T m u 0 T n v 0 T T n v 0 + T T m v 0 T m v 0 1 γ = T n v 0 T n T v 0 + T m T v 0 T m v 0 1 γ (a) γn v 0 T v 0 + γ m T v 0 v 0 1 γ = γn + γ m 1 γ v 0 T v 0, where in (a) we used the fact that T n is a γ n contraction. Since γ < 1 we have that lim sup T n v 0 T m u 0 lim sup γ n + γ m n m n n m n 1 γ v 0 T v 0 = 0, which implies that {T n v 0 } n is a Cauch sequence. Since V is complete b assumption then {T n v 0 } n is convergent to a vector v. Recalling that v n+1 = T v n and b taking the limit on both sides we obtain lim v (a) n+1 = lim T n+1 v 0 = v, n n lim T v (b) n = T lim v n = T v, n n where (a) follows from the definition of v n+1 and the fact that {T n v 0 } n is a convergence Cauch sequence, while (b) follows from the fact that a contraction operator T is also continuous. Joining the two equalities we obtain v = T v which is the definition of fixed point. C Linear Algebra Eigenvalues of a matrix (1). Given a square matrix A R N N, a vector v R N and a scalar λ are eigenvector and eigenvalue of the matrix if Av = λv. Eigenvalues of a matrix (2). Given a square matrix A R N N with eigenvalues {λ i } N i=1, then the matrix (I αa) has eigenvalues {µ i = 1 αλ i } N i=1. Stochastic matrix. A square matrix P R N N is a stochastic matrix if 1. all non-zero entries, i, j, [P ] i,j 0 2. all the rows sum to one, i, N j=1 [P ] i,j = 1. All the eigenvalues of a stochastic matrix are bounded b 1, i.e., i, λ i 1. Matrix inversion. A square matrix A R N N can be inverted if and onl if i, λ i 0. References [Bellman, 1957] Bellman, R. E. (1957). Dnamic Programming. Princeton Universit Press, Princeton, N.J.

23 Markov Decision Processes and Dnamic Programming 23 [Bertsekas and Tsitsiklis, 1996] Bertsekas, D. and Tsitsiklis, J. (1996). Athena Scientific, Belmont, MA. Neuro-Dnamic Programming. [Fleming and Rishel, 1975] Fleming, W. and Rishel, R. (1975). Deterministic and stochastic optimal control. Applications of Mathematics, 1, Springer-Verlag, Berlin New York. [Howard, 1960] Howard, R. A. (1960). Dnamic Programming and Markov Processes. MIT Press, Cambridge, MA. [Puterman, 1994] Puterman, M. (1994). Markov Decision Processes: Discrete Stochastic Dnamic Programming. John Wile & Sons, Inc., New York, Etats-Unis.

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Total Expected Discounted Reward MDPs: Existence of Optimal Policies Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

In English, this means that if we travel on a straight line between any two points in C, then we never leave C. Convex sets In this section, we will be introduced to some of the mathematical fundamentals of convex sets. In order to motivate some of the definitions, we will look at the closest point problem from

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

A. Derivation of regularized ERM duality

A. Derivation of regularized ERM duality A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Linear Algebra in Actuarial Science: Slides to the lecture

Linear Algebra in Actuarial Science: Slides to the lecture Linear Algebra in Actuarial Science: Slides to the lecture Fall Semester 2010/2011 Linear Algebra is a Tool-Box Linear Equation Systems Discretization of differential equations: solving linear equations

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Robotics. Control Theory. Marc Toussaint U Stuttgart

Robotics. Control Theory. Marc Toussaint U Stuttgart Robotics Control Theory Topics in control theory, optimal control, HJB equation, infinite horizon case, Linear-Quadratic optimal control, Riccati equations (differential, algebraic, discrete-time), controllability,

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Algorithms for MDPs and Their Convergence

Algorithms for MDPs and Their Convergence MS&E338 Reinforcement Learning Lecture 2 - April 4 208 Algorithms for MDPs and Their Convergence Lecturer: Ben Van Roy Scribe: Matthew Creme and Kristen Kessel Bellman operators Recall from last lecture

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Lecture notes for the course Games on Graphs B. Srivathsan Chennai Mathematical Institute, India 1 Markov Chains We will define Markov chains in a manner that will be useful to

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Mathematical Optimization Models and Applications

Mathematical Optimization Models and Applications Mathematical Optimization Models and Applications Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 1, 2.1-2,

More information

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME SAUL D. JACKA AND ALEKSANDAR MIJATOVIĆ Abstract. We develop a general approach to the Policy Improvement Algorithm (PIA) for stochastic control problems

More information

November 28 th, Carlos Guestrin 1. Lower dimensional projections

November 28 th, Carlos Guestrin 1. Lower dimensional projections PCA Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 1 Lower dimensional projections Rather than picking a subset of the features, we can new features that are

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Optimal Control of an Inventory System with Joint Production and Pricing Decisions

Optimal Control of an Inventory System with Joint Production and Pricing Decisions Optimal Control of an Inventory System with Joint Production and Pricing Decisions Ping Cao, Jingui Xie Abstract In this study, we consider a stochastic inventory system in which the objective of the manufacturer

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Analysis of Classification-based Policy Iteration Algorithms

Analysis of Classification-based Policy Iteration Algorithms Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

STOCHASTIC PROCESSES Basic notions

STOCHASTIC PROCESSES Basic notions J. Virtamo 38.3143 Queueing Theory / Stochastic processes 1 STOCHASTIC PROCESSES Basic notions Often the systems we consider evolve in time and we are interested in their dynamic behaviour, usually involving

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

MATH4406 Assignment 5

MATH4406 Assignment 5 MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running

More information

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

Simplex Algorithm for Countable-state Discounted Markov Decision Processes Simplex Algorithm for Countable-state Discounted Markov Decision Processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith November 16, 2014 Abstract We consider discounted Markov Decision

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture 5: The Bellman Equation

Lecture 5: The Bellman Equation Lecture 5: The Bellman Equation Florian Scheuer 1 Plan Prove properties of the Bellman equation (In particular, existence and uniqueness of solution) Use this to prove properties of the solution Think

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information