Markov Decision Processes and Dynamic Programming

Size: px

Start display at page:

Download "Markov Decision Processes and Dynamic Programming"

Marcus Porter
6 years ago
Views:

1 Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric lazaric/webpage/teaching.html Objectives of the lecture 1. Understand: Markov decision processes, Bellman equations and Bellman operators. 2. Use: dnamic programming algorithms. 1 The Markov Decision Process 1.1 Definitions Definition 1 (Markov chain). Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dnamic sstem (x t ) t N X is a Markov chain if P(x t+1 = x x t, x t 1,..., x 0 ) = P(x t+1 = x x t ), (1) so that all the information needed to predict (in probabilit) the future is contained in the current state (Markov propert). Given an initial state x 0 X, a Markov chain is defined b the transition probabilit p such that p( x) = P(x t+1 = x t = x). (2) Remark: notice that in some cases we can turn a higher-order Markov process into a Markov process b including the past as a new state variable. For instance, in the control of an inverted pendulum, the state that can be observed is onl the angular position θ t. In this case the sstem is non-markov since the next position depends on the previous position but also on the angular speed, which is defined as dθ t = θ t θ t 1 (as a first approximation). Thus, if we define the state space as x t = (θ t, dθ t ) we obtain a Markov chain. Definition 2 (Markov decision process [Bellman, 1957, Howard, 1960, Fleming and Rishel, 1975, Puterman, 1994, Bertsekas and Tsitsiklis, 1996]). A Markov decision process is defined as a tuple M = (X, A, p, r) where X is the state space (finite, countable, continuous), 1 A is the action space (finite, countable, continuous), 1 In most of our lectures it can be consider as finite such that X = N. 1

2 2 Markov Decision Processes and Dnamic Programming p( x, a) is the transition probabilit (i.e., environment dnamics) such that for an x X, X, and a A p( x, a) = P(x t+1 = x t = x, a t = a), is the probabilit of observing a next state when action a is taking in x, r(x, a, ) is the reinforcement obtained when taking action a, a transition from a state x to a state is observed. 2 Definition 3 (Polic). At time t N a decision rule π t is a mapping from states to actions, in particular it can be Deterministic: π t : X A, π t (x) denotes the action chosen at state x at time t, Stochastic: π t : X (A), π t (a x) denotes the probabilit of taking action a at state x at time t. A polic (strateg, plan) is a sequence of decision rules. In particular, we have Non-stationar: π = (π 0, π 1, π 2,... ), Stationar (Markovian): π = (π, π, π,... ). Remark: an MDP M together with a deterministic (stochastic) stationar polic π forms a dnamic process (x t ) t N (obtained b taking the actions a t = π(x t )) which corresponds to a Markov chain of state X and transition probabilit p( x) = p( x, π(x)). Time horizons Finite time horizon T : the problem is characterized b a deadline at time T (e.g., the end of the course) and the agent onl focuses on the sum of the rewards up to that time. Infinite time horizon with discount: the problem never terminates but rewards which are closer in time receive a higher importance. Infinite time horizon with absorbing (terminal) state: the problem never terminates but the agent will eventuall reach a termination state. Infinite time horizon with average reward: the problem never terminates but the agent onl focuses on the average of the rewards. Definition 4 (The state value function). Depending on the time horizon we consider we have Finite time horizon T V π (t, x) = E [ T 1 r(x s, π s (x s )) + R(x T ) x t = x; π ], (3) s=t where R is a reward function for the final state. 2 Most of the time we will use either r(x, a) (which can be considered as the expected value of r(x, a, )) or r(x) as a state-onl function.

3 Markov Decision Processes and Dnamic Programming 3 Infinite time horizon with discount V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ], (4) t=0 where 0 γ < 1 is a discount factor (i.e., small = focus on short-term rewards, big = focus on long term rewards). 3 Infinite time horizon with absorbing (terminal) states V π (x) = E [ T r(x t, π(x t )) x 0 = x; π ], (5) where T is the first (random) time when the agent achieves a absorbing state. Infinite time horizon with average reward t=0 V π 1 T 1 (x) = lim T T E[ r(x t, π(x t )) x 0 = x; π ]. (6) t=0 Remark: the previous expectations refer to all the possible stochastic trajectories. More precisel, let us consider an MDP M = (X, A, p, r), if an agent follows a non-stationar polic π from the state x 0, then it observes the random sequence of states (x 0, x 1, x 2,...) and the corresponding sequence of rewards (r 0, r 1, r 2,...) where r t = r(x t, π t (x t )). The corresponding value function (in the infinite horizon with discount setting) is then defined as [ ] V π (x) = E (x1,x 2,...) γ t r(x t, π(x t )) x 0 = x; π, t=0 where each x t p( x t 1, a t = π(x t )) is a random realization from the transition probabilit of the MDP. Definition 5 (Optimal polic and optimal value function). The solution to an MDP is an optimal polic π satisfing π arg max π Π V π in all the states x X, where Π is some polic set of interest. The corresponding value function is the optimal value function V = V π. Remark: π arg max( ) and not π = arg max( ) because an MDP ma admit more than one optimal polic. Beside the state value function, we can also introduce an alternative formulation, the state-action value function. For infinite horizon discounted problems we have the following definition (similar for other settings). Definition 6 (The state-action value function). In discounted infinite horizon problems, for an polic π, the state-action value function (or Q-function) Q π : X A R is defined as [ ] Q π (x, a) = E γ t r(x t, a t ) x 0 = x, a 0 = a, a t = π(x t ), t 1, t 0 and the corresponding optimal Q-function is Q (x, a) = max π Qπ (x, a). 3 Mathematical interpretation: for an γ [0, 1) the series alwas converge (for bounded rewards).

4 4 Markov Decision Processes and Dnamic Programming The relationships between the V-function and the Q-function are: Q π (x, a) = r(x, a) + γ X p( x, a)v π () V π (x) = Q π (x, π(x)) Q (x, a) = r(x, a) + γ X p( x, a)v () V (x) = Q (x, π (x)) = max a A Q (x, a). Finall, from an optimal Q-function we can deduce the optimal polic as π (x) arg max a A Q (x, a). 1.2 Examples Example 1 (the MVA student dilemma). 1 p=0.5 Rest r=1 Rest 0.4 r= 10 r=0 Work Work Rest 0.6 r=100 r= Work 0.5 r= Rest Work r= 1000 Model: all the transitions are Markov since the probabilit of transition onl depend on the previous state. The states x 5, x 6, x 7 are terminal (absorbing) states. Setting: infinite horizon with terminal states. Objective: find the polic that maximizes the expected sum of rewards before achieving a terminal state. Problem 1 : if a student knows the transition probabilities and the rewards, how does he/she compute the optimal strateg? Solution:

5 Markov Decision Processes and Dnamic Programming 5 V = r=0 0.5 Rest Work r= 1 V = p= Work 0.5 r= Rest 0.6 Work 0.5 r= 10 V = V = Rest Rest Work r=100 r= 10 V = 10 5 V = r= 1000 V = V 5 = 10, V 6 = 100, V 7 = 1000 V 4 = V V V 3 = V V V 2 = V V 1 V 1 = max{0.5v V 1, 0.5V V 1 } V 1 = V 2 = 88.3 Problem 2 : what if the student doesn t know the MDP? Example 2 (The retail store management problem). At each month t, a store contains x t items of a specific goods and the demand for that goods is D t. At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that The cost of maintaining an inventor of x is h(x). The cost to order a items is C(a). The income for selling q items is f(q). If the demand D is bigger than the available inventor x, the customers that cannot be served will leave and move to another store. The income of the remaining inventor at the end of the ear is g(x). Constraint: the store has a maximum capacit M. Objective: the performance of the manager is evaluated as the profit obtained over an horizon of a ear (T = 12 months). Problem: How do we formalize the problem? Solution: we define the following simplified model using the MDP formalism.

6 6 Markov Decision Processes and Dnamic Programming State space: x X = {0, 1,..., M}. Action space: it is not possible to order more items that the capacit of the store, then the action space should depend on the current state. Formall, at statex, a A(x) = {0, 1,..., M x}. Dnamics: x t+1 = [x t + a t D t ] +. Problem: the dnamics should be Markov and stationar. The demand D t is stochastic and timeindependent. Formall, D t i.i.d. D. Reward: r t = C(a t ) h(x t + a t ) + f([x t + a t x t+1 ] + ). Objective function: E [ T 1 t=1 r t + g(x T ) ] Example 3 (The parking problem). A driver wants to park his car as close as possible to the restaurant. 1 2 Reward t T p(t) Restaurant Reward 0 We know that The driver cannot see whether a place is available unless he/she is in front of it. There are P places. At each place i the driver can either move to the next place or park (if the place is available). The closer to the restaurant the parking, the higher the satisfaction. If the driver doesn t park anwhere, then he/she leaves the restaurant and has to find another one. Objective: maximize the satisfaction. Problem: How do we formalize the problem? Solution: we define the following simplified model using the MDP formalism. State space: x X = {(1, T ), (1, A), (2, T ), (2, A),..., (P, T ), (P, A), parked, left}. Each of the P places can be (A)vailable or (T )aken. Whenever the driver parks, he reaches the state parked, while if he never parks then the state becomes left. Both these states are terminal states. Action space: A(x) = {park, continue} if x = (, A) or A(x) = {continue} if x = (, T ). Dnamics: (see graph). We assume that the probabilit of a place i to be available is ρ(i). Reward: for an i [1, P ] we have r(x, park, parked) = i and 0 otherwise.

7 Markov Decision Processes and Dnamic Programming 7 Objective function: E [ T t=1 r t] with T the random time when a terminal state is reached. Rew 1 Rew 2 Rew T Park Park Park p(1) L Cont p(2) L Cont p(t) L Cont Rew 0 D 1 p(1) O 1 p(2) O 1 p(t) O Rew T Example 4 (The tetris game). Model: State space: configuration of the wall and next piece and terminal state when the well reach the maximum height. Action space: position and orientation of the current space in the wall. Dnamics: new configuration of the well and new random piece. Reward: number of deleted rows. Objective function: E [ T t=1 r t] with T the random time when a terminal state is reached. (remark: it has been proved that the game eventuall terminates with probabilit 1 for an plaing strateg). Problem: Compute the optimal strateg. Solution: unknown! The state space is huge! X = for a problem with maximum height 20, width 10 and 7 different pieces. 1.3 Finite Horizon Problems Proposition 1. For an horizon T and a (non-stationar) polic π = (π 0,..., π T 1 ), the state value function at a state x X at time t {0,..., T } satisfies the Bellman equation: V π (t, x) = E [ T 1 r(x s, π s (x s )) + R(x T ) x t = x; π ]. (7) s=t

8 8 Markov Decision Processes and Dnamic Programming Proof. Recalling equation 3 and Def. 5, we have that V π (t, x) = E [ T 1 r(x s, π s (x s )) + R(x T ) x t = x; π ] s=t [ T (a) 1 = r(x, π t (x)) + E xt+1,x t+2,...,x T 1 = r(x, π t (x)) + X (b) = r(x, π t (x)) + X s=t+1 ] r(x s, π s (x s )) + R(x T ) x t = x; π [ T 1 P[x t+1 = x t = x; a t = π(x t )]E xt+2,...,x T 1 p( x, π(x))e [ T 1 s=t+1 (c) = r(x, π t (x)) + X p( x, π(x))v π (t + 1, ), s=t+1 r(x s, π s (x s )) + R(x T ) x t+1 = x; π ] ] r(x s, π s (x s )) + R(x T ) x t+1 = ; π where (a) The expectation is conditioned on x t = x. (b) Application of the law of total expectation. (c) Definition of the MDP dnamics p and of the value function. Bellman s Principle of Optimalit [Bellman, 1957]: An optimal polic has the propert that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal polic with regard to the state resulting from the first decision. Proposition 2. The optimal value function V (t, x) (i.e., V = max π V π ) is the solution to the optimal Bellman equation: V (t, x) = [ max r(x, a) + p( x, a)v (t + 1, ) ], with 0 t < T a A X (8) V (T, x) = R(x), and the polic is an optimal polic. πt [ (x) arg max r(x, a) + p( x, a)v (t + 1, ) ], with 0 t < T. a A X

9 Markov Decision Processes and Dnamic Programming 9 Proof. B definition we have V (T, x) = R(x), then V is defined b backward propagation for an t < T. An polic π applied from time t < T on at state x can be written as π = (a, π ) with a A is the action taken at time t in x and π = (π t+1,..., π T 1 ). Then we can show that V (t, x) (a) = max π r(x s, π s (x s )) + R(x T ) x t = x; π ] E[ T 1 s=t [ (b) = max r(x, a) + ] p( x, a)v π (t + 1, ) (a,π ) X (c) = max a A (d) = max a A [ r(x, a) + X p( x, a) max π V π (t + 1, ) [ r(x, a) + p( x, a)v (t + 1, ) ]. X ] where (a) : definition of value function from equation 3, (b) : decomposition of polic π = (a, π ) and recursive definition of the value function, (c) : follows from Trivial inequalit max π p( x, a)v π (t + 1, ) p( x, a) max π V π (t + 1, ) Let π = ( π t+1,... ) a polic such that π t+1 () = arg max b A max (πt+2,... ) V (b,πt+2,... ) (t + 1, ). Then p( x, a) max V π (t + 1, ) = p( x, a)v π (t + 1, ) max p( x, a)v π (t + 1, ). π π (d) : definition of optimal value function. Finall the optimal polic simpl takes the action for which the maximum is attained at each iteration. Remark. The previous Bellman equations can be easil extended to the case of state-action value functions. Parking example. Let V (p, A) (resp. V (p, T )) the optimal value function when at position p and the place is available (resp. taken). The optimal value function can be constructed using the optimal Bellman equations as: V (P, A) = max{p, 0} = P, V (P, T ) = 0, V (P 1, A) = max{p 1, ρ(p )V (P, A) + (1 ρ(p ))V (P, T ))} V (p, A) = max{t, ρ(p + 1)V (p + 1, A) + (1 ρ(p + 1))V (p + 1, T )}, where the max corresponds to the two possible choices (park or continue) and the corresponding optimal polic can be derived as the arg max of the previous maxima.

10 10 Markov Decision Processes and Dnamic Programming 1.4 Discounted Infinite Horizon Problems Bellman Equations Proposition 3. For an stationar polic π = (π, π,... ), the state value function at a state x X satisfies the Bellman equation: V π (x) = r(x, π(x)) + γ p( x, π(x))v π (). (9) Proof. For an polic π, V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ] t 0 = r(x, π(x)) + E [ γ t r(x t, π(x t )) x 0 = x; π ] = r(x, π(x)) + γ = r(x, π(x)) + γ t 1 P(x 1 = x 0 = x; π(x 0 ))E [ γ t 1 r(x t, π(x t )) x 1 = ; π ] p( x, π(x))v π (). t 1 Proposition 4. The optimal value function V (i.e., V = max π V π ) is the solution to the optimal Bellman equation: V [ (x) = max r(x, a) + γ p( x, a)v () ]. (10) a A Proof. For an polic π = (a, π ) (possibl non-stationar), V (x) (a) = max π E[ γ t r(x t, π(x t )) x 0 = x; π ] t 0 [ (b) = max r(x, a) + γ (a,π ) (c) = max a (d) = max a [ r(x, a) + γ [ r(x, a) + γ ] p( x, a)v π () ] p( x, a) max V π () π ] p( x, a)v (). (11) where (a) : definition of value function from equation 4,

11 Markov Decision Processes and Dnamic Programming 11 (b) : decomposition of polic π = (a, π ) and recursive definition of the value function, (c) : follows from Trivial inequalit max π p( x, a)v π () p( x, a) max V π () π Let π() = arg max π V π (). Then p( x, a) max V π () = π p( x, a)v π () max π p( x, a)v π (). (d) : definition of optimal value function. The equivalent Bellman equations for state-action value functions are: Q π (x, a) = r(x, a) + γ X p( x, a)q π (, π()) Q (x, a) = r(x, a) + γ X p( x, a) max b A Q (, b) Bellman Operators Notation. W.l.o.g. from this moment on we consider a discrete state space X = N, so that for an polic π, V π is a vector of size N (i.e., V π R N ). Definition 7. For an W R N, the Bellman operator T π : R N R N is defined as T π W (x) = r(x, π(x)) + γ p( x, π(x))w (), (12) and the optimal Bellman operator (or dnamic programming operator) is defined as [ T W (x) = max r(x, a) + γ p( x, a)w () ]. (13) a A Proposition 5. The Bellman operators enjo a number of properties: 1. Monotonicit: for an W 1, W 2 R N, if W 1 W 2 component-wise, then T π W 1 T π W 2, T W 1 T W Offset: for an scalar c R, T π (W + ci N ) = T π W + γci N, T (W + ci N ) = T W + γci N,

12 12 Markov Decision Processes and Dnamic Programming 3. Contraction in L -norm: for an W 1, W 2 R N 4. Fixed point: For an polic π T π W 1 T π W 2 γ W 1 W 2, T W 1 T W 2 γ W 1 W 2. V π is the unique fixed point of T π, V is the unique fixed point of T. Furthermore for an W R N and an stationar polic π lim (T π ) k W = V π, k lim (T k )k W = V. Proof. Monotonicit (1) and offset (2) directl follow from the definitions. The contraction propert (3) holds since for an x X we have T W 1 (x) T W 2 (x) = (a) max a = γ max a max a [ r(x, a) + γ p( x, a)w 1 () ] [ max r(x, a ) + γ a [ r(x, a) + γ p( x, a)w 1 () ] [ r(x, a) + γ p( x, a) W 1 () W 2 () γ W 1 W 2 max a p( x, a) = γ W 1 W 2, where in (a) we used max a f(a) max a g(a ) max a (f(a) g(a)). p( x, a)w 2 () ] p( x, a )W 2 () ] The fixed point propert follows from the Bellman equations (eq. 9 and 10) and Banach fixed point theorem (see Theorem 11). The propert regarding the convergence of the limit of the Bellman operators is actuall used in the proof of the Banach theorem. [ Remark 1 (optimal polic): An stationar polic π (x) arg max a A r(x, a) + γ p( x, a)v () ] is optimal. In fact, from the definition of π we have that T π V = T V = V where the first equalit follows from the fact that the optimal Bellman operator coincides with the action taken b π and the second from the fixed point propert of T. Furthermore, b the fixed point propert of T π we have that V π is the fixed point of T π which is also unique. Then V π = V thus impling that π is an optimal polic. Remark 2 (value/polic iteration): most of the dnamic programming algorithms studied in Section 2 will heavil rel on propert (4). Bellman operators for Q-functions.

13 Markov Decision Processes and Dnamic Programming 13 Both the Bellman T π and the optimal Bellman operators T can be extended to Q-functions. Thus for an function 4 W : X A R, we have T π W (x, a) = r(x, a) + γ X p( x, a)w (, π()), T W (x, a) = r(x, a) + γ p( x, a) max W (, b). b A X This allows to extend also all the properties in Proposition 5, notabl that Q π (resp. Q ) is the fixed point of T π (resp. T ). 1.5 Undiscounted Infinite Horizon Problems Bellman equations Recall the definition of value function for infinite horizon problems with absorbing states in eq. 5: V π (x) = E [ r(x t, π(x t )) x 0 = x; π ] = E [ T r(x t, π(x t )) x 0 = x; π ], t=0 where T is the first (random) time when the agent achieves a absorbing state. The equivalence follows from the fact that we assume that once the absorbing state is achieved the sstem stas in that state indefinitel long with a constant reward of 0. Definition 8. A stationar polic π is proper if there exists an integer n N such that from an initial state x X the probabilit of achieving the terminal state (denoted b x) after n steps is strictl positive. That is ρ π = max x P(x n x x 0 = x, π) < 1. Proposition 6. For an proper polic π with parameter ρ π after n steps, the value function is bounded as V π r max ρ t/n π. t 0 t=0 Proof. We have that b def. 8 Then for an t N P(x 2n x x 0 = x, π) = P(x 2n x x n x, π) P(x n x x 0 = x, π) ρ 2 π. P(x t x x 0 = x, π) ρ t/n π, which implies that eventuall the terminal state x is achieved with probabilit 1. Then V π = max E[ r(x t, π(x t )) x 0 = x; π ] r max P(x t x x 0 = x, π) nr max + r max ρ t/n π. x X t=0 t>0 t n 4 It is simpler to state W as function rather than a vector as done for value functions.

14 14 Markov Decision Processes and Dnamic Programming Bellman operators Assumption. There exists at least one proper polic and for an non-proper polic π there exists at least one state x where the corresponding value function is negativel unbounded, i.e., V π (x) =, which corresponds to the existence of a ccle with onl negative rewards. Proposition 7. [Bertsekas and Tsitsiklis, 1996] Under the previous assumption, the optimal value function is bounded, i.e., V < and it is the unique fixed point of the optimal Bellman operator T such that for an vector W R n T W (x) = max a A Furthermore, we have that V = lim k (T ) k W. [ r(x, a) + p( x, a)w ()]. Proposition 8. Let all the policies π be proper, then there exist a vector µ R N with µ > 0 and a scalar β < 1 such that, x, X N, a A, p( x, a)µ() βµ(x). Then it follows that both operators T and T π are contraction in the weighted norm L,µ, that is T W 1 T W 2,µ β W 1 W 2,µ. Proof. Let µ be the maximum (over all policies) of the average time to the terminal goal. This can be easil casted to a MDP where for an action and an state the rewards are 1 (i.e., for an x X and a A, r(x, a) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to the dnamic programming equation µ(x) = 1 + max a p( x, a)µ(). Then µ(x) 1 and for an a A, µ(x) 1 + p( x, a)µ(). Furthermore, p( x, a)µ() µ(x) 1 βµ(x), for β = max x µ(x) 1 µ(x) From this definition of µ and β we obtain the contraction propert of T (similar for T π ) in norm L,µ : T W 1 T W 2,µ = max x max x,a max x,a < 1. T W 1 (x) T W 2 (x) µ(x) p( x, a) W 1 () W 2 () µ(x) p( x, a)µ() W 1 W 2 µ µ(x) β W 1 W 2 µ

15 Markov Decision Processes and Dnamic Programming 15 2 Dnamic Programming 2.1 Value Iteration (VI) The idea. We build a sequence of value functions. Let V 0 be an vector in R N, then iterate the application of the optimal Bellman operator so that given V k at iteration k we compute V k+1 = T V k. From Proposition 5 we have that lim k V k = V, thus a repeated application of the Bellman operator make the initial guess V 0 closer and closer to the actual optimal value function. At an iteration K, the algorithm returns the greed polic π K (x) arg max a A [ r(x, a) + γ p( x, a)v K () ]. Guarantees. Using the contraction propert of the Bellman operator we obtain the convergence of V 0 to V. In fact, V k+1 V = T V k T V γ V k V γ k+1 V 0 V 0. This also provides the convergence rate of VI. Let ɛ > 0 be a desired level of accurac in approximating V and r r max, then after at most K = log(r max/ɛ) log(1/γ) iterations VI returns a value function V K such that V K V < ɛ. In fact, after K iterations we have V K V < γ K+1 V 0 V γ K+1 r max ɛ. Computational complexit. One application of the optimal Bellman operator takes O(N 2 A) operations. Remark: Notice that the previous guarantee is for the value function returned b VI and not the corresponding polic π K. Q-iteration. Exactl the same algorithm can be applied to Q-functions using the corresponding optimal Bellman operator. Implementation. There exist several implementations depending on the order used to update the different components of V k (i.e., the states). For instance, in asnchronous VI, at each iteration k, one single state x k is chosen and onl the corresponding component of V k is updated, i.e., V k+1 (x k ) = T V k (x k ), while all the other states remain unchanged. If all the states are selected infinitel often, then V k V. Pros: each iteration is ver computationall efficient. Cons: convergence is onl asmptotic.

16 16 Markov Decision Processes and Dnamic Programming 2.2 Polic Iteration (PI) Notation. Further extending the vector notation, for an polic π we introduce the reward vector r π R N as r π (x) = r(x, π(x)) and the transition matrix P π R N N as P π (x, ) = p( x, π(x)) The idea. We build a sequence of policies. Let π 0 be an stationar polic. At each iteration k we perform the two following steps: 1. Polic evaluation given π k, compute V π k. 2. Polic improvement: we compute the greed polic π k+1 from V π k as: π k+1 (x) arg max a A The iterations continue until V π k = V π k+1. [ r(x, a) + γ p( x, a)v π k () ]. Remark: Notice π k+1 is called greed since it is maximizing the current estimate of the future rewards, which corresponds to the application of the optimal Bellman operator, since T π k+1 V π k = T V π k. Remark: The PI algorithm is often seen as an actor-critic algorithm. Guarantees. Proposition 9. The polic iteration algorithm generates a sequences of policies with non-decreasing performance (i.e., V π k+1 V π k )) and it converges to π in a finite number of iterations. Proof. From the definition of the operators T, T π k, T π k+1 and of the greed polic π k+1, we have that and from the monotonicit propert of T π k+1, it follows that Joining all the inequalities in the chain we obtain V π k = T π k V π k T V π k = T π k+1 V π k, (14) V π k T π k+1 V π k, T π k+1 V π k (T π k+1 ) 2 V π k,... (T π k+1 ) n 1 V π k (T π k+1 ) n V π k, V π k... lim n (T π k+1 ) n V π k = V π k+1. Then (V π k ) k is a non-decreasing sequence. Furthermore, in a finite MDP we have a finite number of policies, then the termination condition is alwas met for a specific k. Thus eq. 14 holds with an equalit and we obtain V π k = T V π k and V π k = V which implies that π k is an optimal polic.

17 Markov Decision Processes and Dnamic Programming 17 Remark: More recent and refined proofs of the convergence rate of PI are available. Polic evaluation. At each iteration k the value function V π k corresponding to the current polic π k is computed. There are a number of different approaches to compute V π k. Direct computation. From the vector notation introduced before and the Bellman equation in Proposition 4 we have that for an polic π: V π = r π + γp π V π. B rearranging the terms we write the previous equation as (I γp π )V π = r π. Since P π is a stochastic matrix, then all its eigenvalues are 1. Thus the eigenvalues of the matrix (I γp π ) are bounded b 1 γ, which guarantees that it is invertible. Then we can compute V π as V π = (I γp π ) 1 r π. The inversion can be done with different methods. For instance, the Gauss-Jordan elimination algorithm has a complexit O(N 3 ), which can be lowered to O(N ) using Strassen s algorithm. Iterative polic evaluation. For an polic π from the properties of the polic Bellman operator T π, for an V 0 we have that lim n T π V 0 = V π. Thus an approximation of V π could be computed b re-iterating the application T π for n steps. In particular, in order to achieve an ɛ-approximation of V π we need O(N 2 log 1/ɛ log 1/γ ) steps. Monte-Carlo simulation. In each state x, we simulate n trajectories ((x i t) t 0, ) 1 i n following polic π, where for an i = 1,..., n, x i 0 = x and x i t+1 p( x i t, π(x i t)). We compute ˆV π (x) 1 n The approximation error is of order O(1/ n). Temporal-difference (TD) see next lecture. n γ t r(x i t, π(x i t)). i=1 t 0 Polic improvement. The computation of π k+1 requires the computation of an expectation w.r.t. the next state, which might be as expensive as O(N A ). If we move to polic iteration of Q-functions, then the polic improvement step simplifies to π k+1 (x) arg max Q(x, a), a A which has a computational cost of O( A ). Furthermore, this could allow to compute the greed polic even when the dnamics p is not known. Pros: converge in a finite number of iterations (often small in practice). Cons: each iteration requires a full polic evaluation and it might be expensive.

18 18 Markov Decision Processes and Dnamic Programming 2.3 Linear Programming (LP) Geometric interpretation of PI. We first provide a different interpretation of PI as a Newton method tring to find the zero of the Bellman residual operator. Let B = T I the Bellman residual operator. From the definition of V as the fixed point of T, it follows that BV = 0. Thus, computing V as the fixed point of T is equivalent to compute the zero of B. We first prove the following proposition. Proposition 10. Let B = T I the Bellman residual operator and B its derivative. Then sequence of policies π k generated b PI satisfies V π k+1 = V π k (γp π k+1 I) 1 [T π k+1 V π k V π k ] = V π k [B ] 1 BV π k, which coincides with the standard formulation of the Newton s method. Proof. B the definition of V π k and the Bellman operators we have V π k+1 = (I γp π k+1 ) 1 r π k+1 V π k + V π k = V π k + (I γp π k+1 ) 1 [r π k+1 + (γp π k+1 I)V π k ] π 1 T V V TV V π 2 T V V π 3 T V V π 1 π 2 π V 3 V V = V* V Following the geometric interpretation, we have that V π k is the zero of the linear operator T π k I. Since the application V T V V = max π T π V V is convex, then we have that the Newton s method is guaranteed to converge for an V 0 such that T V 0 V 0 0. Linear Programming The previous intuition is at the basis of the observation that V is the smallest vector V such that V T V. In fact, V T V = V lim k (T )k V = V. Thus, we can compute V as the solution to the linear program: Min x V (x),

19 Markov Decision Processes and Dnamic Programming 19 Subject to (finite number of linear inequalities which defines a polhidron in R N ): V (x) r(x, a) + γ p( x, a)v (), x X, a A which contains N variables and N A constraints.

20 20 Markov Decision Processes and Dnamic Programming A Probabilit Theor Definition 9 (Conditional probabilit). Given two events A and B with P(B) > 0, the conditional probabilit of A given B is P(A B) P(A B) =. P(B) Similarl, if X and Y are non-degenerate and jointl continuous random variables with densit f X,Y (x, ) then if B has positive measure then the conditional probabilit is B x A P(X A Y B) = f X,Y (x, )dxd x f X,Y (x, )dxd. B Definition 10 (Law of total expectation). Given a function f and two random variables X, Y we have that E X,Y [ f(x, Y ) ] = EX [E Y [ f(x, Y ) X = x ] ]. B Norms, Contractions, and Banach s Fixed-Point Theorem Definition 11. Given a vector space V R d a function f : V R + 0 is a norm if an onl if If f(v) = 0 for some v V, then v = 0. For an λ R, v V, f(λv) = λ f(v). Triangle inequalit: For an v, u V, f(v + u) f(v) + f(u). A list of useful norms. L p -norm L -norm L µ,p -norm L µ,p -norm ( d ) 1/p v p = v i p. i=1 v = max 1 i d v i. ( d v µ,p = i=1 v i p ) 1/p. µ i v i v µ, = max. 1 i d µ i L 2,P -matrix norm (P is a positive definite matrix) v 2 P = v P v.

21 Markov Decision Processes and Dnamic Programming 21 Definition 12. A sequence of vectors v n V (with n N) is said to converge in norm to v V if lim v n v = 0. n Definition 13. A sequence of vectors v n V (with n N) is a Cauch sequence if lim sup v n v m = 0. n m n Definition 14. A vector space V equipped with a norm is complete if ever Cauch sequence in V is convergent in the norm of the space. Definition 15. An operator T : V V is L-Lipschitz if for an v, u V T v T u L u v. If L 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if v n v then T v n T v. Definition 16. A vector v V is a fixed point of the operator T : V V if T v = v. Proposition 11. Let V be a complete vector space equipped with the norm and T : V V be a γ-contraction mapping. Then 1. T admits a unique fixed point v. 2. For an v 0 V, if v n+1 = T v n then v n v with a geometric convergence rate: v n v γ n v 0 v. Proof. We first derive the fundamental contraction propert. For an v, u V: v u = v T v + T v T u + T u u (a) v T v + T v T u + T u u (b) v T v + γ v u + T u u, where (a) follows from the triangle inequalit and (b) from the contraction propert of T. Rearranging the terms we obtain: v u v T v + T u u. (15) 1 γ If v and u are both fixed points then T v v = 0 and T u u = 0, thus from the previous inequalit v u 0 which implies v = u. Then T admits at most one fixed point.

22 22 Markov Decision Processes and Dnamic Programming Let v 0 V, for an n, m N an iterative application of eq. 15 gives T n v 0 T m u 0 T n v 0 T T n v 0 + T T m v 0 T m v 0 1 γ = T n v 0 T n T v 0 + T m T v 0 T m v 0 1 γ (a) γn v 0 T v 0 + γ m T v 0 v 0 1 γ = γn + γ m 1 γ v 0 T v 0, where in (a) we used the fact that T n is a γ n contraction. Since γ < 1 we have that lim sup T n v 0 T m u 0 lim sup γ n + γ m n m n n m n 1 γ v 0 T v 0 = 0, which implies that {T n v 0 } n is a Cauch sequence. Since V is complete b assumption then {T n v 0 } n is convergent to a vector v. Recalling that v n+1 = T v n and b taking the limit on both sides we obtain lim v (a) n+1 = lim T n+1 v 0 = v, n n lim T v (b) n = T lim v n = T v, n n where (a) follows from the definition of v n+1 and the fact that {T n v 0 } n is a convergence Cauch sequence, while (b) follows from the fact that a contraction operator T is also continuous. Joining the two equalities we obtain v = T v which is the definition of fixed point. C Linear Algebra Eigenvalues of a matrix (1). Given a square matrix A R N N, a vector v R N and a scalar λ are eigenvector and eigenvalue of the matrix if Av = λv. Eigenvalues of a matrix (2). Given a square matrix A R N N with eigenvalues {λ i } N i=1, then the matrix (I αa) has eigenvalues {µ i = 1 αλ i } N i=1. Stochastic matrix. A square matrix P R N N is a stochastic matrix if 1. all non-zero entries, i, j, [P ] i,j 0 2. all the rows sum to one, i, N j=1 [P ] i,j = 1. All the eigenvalues of a stochastic matrix are bounded b 1, i.e., i, λ i 1. Matrix inversion. A square matrix A R N N can be inverted if and onl if i, λ i 0. References [Bellman, 1957] Bellman, R. E. (1957). Dnamic Programming. Princeton Universit Press, Princeton, N.J.

23 Markov Decision Processes and Dnamic Programming 23 [Bertsekas and Tsitsiklis, 1996] Bertsekas, D. and Tsitsiklis, J. (1996). Athena Scientific, Belmont, MA. Neuro-Dnamic Programming. [Fleming and Rishel, 1975] Fleming, W. and Rishel, R. (1975). Deterministic and stochastic optimal control. Applications of Mathematics, 1, Springer-Verlag, Berlin New York. [Howard, 1960] Howard, R. A. (1960). Dnamic Programming and Markov Processes. MIT Press, Cambridge, MA. [Puterman, 1994] Puterman, M. (1994). Markov Decision Processes: Discrete Stochastic Dnamic Programming. John Wile & Sons, Inc., New York, Etats-Unis.

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes