The Value Functions of Markov Decision Processes

Size: px

Start display at page:

Download "The Value Functions of Markov Decision Processes"

Erin Henry
6 years ago
Views:

1 The Value Functions of Markov Decision Processes arxiv:.0377v [math.pr] 7 Nov 0 Ehud Lehrer, Eilon Solan, and Omri N. Solan November 0, 0 Abstract We provide a full characterization of the set of value functions of Markov decision processes. Introduction Markov decision processes are a standard tool for studying dynamic optimization problems. The discounted value of such a problem is the maximal total discounted amount that the decision maker can guarantee to himself. By Blackwell (96), the function λ v λ (s) that assigns the discounted value at the initial state s to each discount factor λ is the maximum of finitely many rational functions (with real coefficients). Standard arguments show that the roots of the polynomial in the denominator of these rational functions either lie outside the unit ball in the complex plane, or on the boundary of the unit ball, in which case they have Lehrer acknowledges the support of the Israel Science Foundation, Grant #963/. Solan acknowledges the support of the Israel Science Foundation, Grants #33/3. School of Mathematical Sciences, Tel Aviv University, Tel Aviv , Israel and INSEAD, Bd. de Constance, 7730 Fontainebleau Cedex, France. lehrer@post.tau.ac.il. School of Mathematical Sciences, Tel Aviv University, Tel Aviv , Israel. eilons@post.tau.ac.il. School of Mathematical Sciences, Tel Aviv University, Tel Aviv , Israel. omrisola@post.tau.ac.il.

2 multiplicity. Using the theory of eigenvalues of stochastic matrices one can show that the roots on the boundary of the unit ball must be unit roots. In this note we prove the converse result: every function λ v λ that is the maximum of finitely many rational functions such that each root of the polynomials in the denominators either lies outside the unit ball in the complex plane, or is a unit root with multiplicity is the value of some Markov decision process. The Model and the Main Theorem Definition A Markov decision process is a tuple (S, A, r, q) where S is a finite set of states. A distribution µ (S) according to which the initial state is chosen. A = (A(s)) s S is the family of sets of actions available at each state s S. Denote SA := {(s, a): s S, a A(s)}. r : SA R is a payoff function. q : SA (S) is a transition function. The process starts at an initial state s S, chosen according to µ. It then evolves in discrete time: at every stage n N the process is in a state s n S, the decision maker chooses an action a n A(s n ), and a new state s n+ is chosen according to q( s n, a n ). A finite history is a sequence h n = (s, a, s, a,, s n ) H := k=0 (SA)k S. A pure strategy is a function σ : H s S A(s) such that σ(h n ) A(s n ) for every finite history h n = (s, a,, s n ), and a behavior strategy is a function σ : H s S (A(s)) such that σ(h n ) (A(s n )) for every such finite history. The set of behavior strategies is denoted B. In other words, σ assign to every finite history a distribution over A, which we call a mixed action. A strategy is stationary For every finite set X, the set of probability distributions over X is denoted (X).

3 if for every finite history h n = (s, a,, s n ), the mixed action σ(h n ) is a function of s n and is independent of (s, a,, a n ). Every behavior strategy together with a prior distribution µ over the state space induce a probability distribution P µ,σ over the space of infinite histories (SA) (which is endowed with the product σ-algebra). Expectation w.r.t. this probability distribution is denoted E µ,σ. For every discount factor λ [0, ), the λ-discounted payoff is [ ] γ λ (µ, σ) := E µ,σ λ n r(s n, a n ). n= When µ is a probability measure that is concentrated on a single state s we denote the λ-discounted payoff also by γ(s, σ). decision process, with the prior µ over the initial state is The λ-discounted value of the Markov v λ (µ) := sup γ λ (µ, σ). () σ B A behavior strategy is λ-discounted optimal if it attains the maximum in (). Denote by V the set of all functions λ v λ (µ) that are the value function of some Markov decision processes starting with some prior µ (S). The goal of the present note is to characterize the set V. Notation (i) Denote by F the set of all rational functions P such that each Q root of Q is (a) outside the unit ball, or (b) a unit root with multiplicity. (ii) Denote by MaxF the set of functions that are the maximum of a finite number of functions in F. The next proposition states that any function in V is the maximum of a finite number of functions in F Proposition V MaxF. Proof. By Blackwell (96), for every λ [0, ) there is a λ-discounted pure stationary optimal strategy. Since the number of pure stationary strategies is Recall that a complex number ω C is a unit root if there exists n N such that ω n =. 3

4 finite, it is sufficient to show that the function λ γ λ (µ, σ) is in F, for every pure stationary strategy σ. For every pure stationary strategy σ, every prior µ, and every discount factor λ [0, ), the vector (γ λ (s, σ)) s S is the unique solution of a system of S linear equations 3 in λ: γ λ (s, σ) = r(s, σ(s)) + λ q(s s, σ(s))γ λ (s, σ), s S, s S where r(s, σ(s)) := a A(s) σ(a s)r(s, a) and q(s s, σ(s)) := a A(s) σ(a s)q(s s, a) are the multilinear extensions of r and q, respectively. It follows that γ λ (, σ) = (I λq(, σ( ))) r(, σ( )), where Q(, σ( )) = Q = (Q s,s ) s,s S is the transition matrix induced by σ, that is, Q s,s = q(s s, σ(s)). By Cramer s rule, the function λ (I λq) is a rational function whose denominator is det(i λq). In particular, the roots of the denominator are the inverse of the eigenvalues of Q. Since the denominator is independent of s, it is also the denominator of γ λ (µ, σ) = s S µ(s)γ λ(s, σ). Denote the expected payoff at stage n by x n := E µ,σ [r(s n, σ(h n ))], so that γ λ (µ, σ) = n= x nλ n. Since x n max s S,a A(s) r(s, a) for every n N, it follows that the denominator det(i λq(, σ( )) does not have roots in the interior of the unit ball and that all its roots that lie on the boundary of the unit ball have multiplicity. Moreover, by, e.g., Higham and Lin (0) it follows that the roots that lie on the boundary of the unit ball must be unit roots. The main result of this note is that the converse holds as well. Theorem A function w : [0, ) R is in V if and only if it is the maximum of a finite number of functions in F. To avoid cumbersome notations we write f(λ) for the function λ f(λ). In particular, λf(λ) will denote the function λ λf(λ). We start with the following observation. 3 The result is valid for every stationary strategy, not necessarily pure. We will need it only for pure stationary strategies. 4

5 Lemma If f, g V then max{f, g} V. Proof. Let M f = (S f, µ f, A f, r f, q f ) and M g = (S g, µ g, A g, r g, q g ) be the Markov decision processes that implement f and g respectively. To implement max{f, g}, define a Markov decision process that contains M f, M g, and an additional state s, in which the decision maker chooses one of M f and M g. Formally, let M = (S f S g {s }, A, r, q ) be a Markov decision process in which A, r, and q coincide with A f, r f, and q f (resp. A g, r g, and q g ) on S f (resp. S g ), and in which in the state s the decision maker chooses whether to follow M f or M g, and the payoff and transitions at that state are the expectation of the payoff and transitions at the first stage in M f (resp. M g ) according to µ f (resp. µ g ): A (s ) := ( s S A f (s)) ( s S A g (s)) {f, g}, and for every a f s S A f (s) and every a g s S A g (s), r (s, (a f, a g, x)) := q ( s, (a f, a g, x)) := { s S µ f(s)r f (s, a f (s)) x = f, s S µ g(s)r g (s, a f (s)) x = g. { s S µ f(s)q f ( (s,f, a f (s))) x = f, s S µ g(s)q g ( (s,g, a g (s))) x = g. The reader can verify that the value function of M at the initial state s max{f, g}. is 3 Degenerate Markov Decision Processes As mentioned earlier, for every pure stationary strategy σ and every initial state s, the function λ γ λ (s, σ) is a rational function of λ. Since there is a λ-discounted pure stationary optimal strategy, and since there are finitely many such functions, it follows that the function λ v λ is the maximum of finitely many rational functions, each of which is the payoff function of some pure stationary strategy. When the decision maker follows a pure stationary strategy, we are reduced to a Markov decision process in which there is a single action in each state. This observation leads us to the following definition.

6 A Markov decision process is degenerate if A(s) = for every s S. When M is a degenerate Markov decision process we omit the reference to the action in the functions r and q. A degenerate Markov decision process is thus a quadruple (S, µ, r, q), where S is the state space, µ is a probability distribution over S, r : S R, and q( s) is a probability distribution for every state s S. Denote by V D the set of all functions that are payoff functions of some degenerate Markov decision process. Plainly, V D V. To prove Theorem, we first prove that degenerate Markov decision processes implement all functions in F. Theorem F V D. Theorem and Lemma show that MaxF V, and together with Proposition they establish Theorem. The rest of the paper is dedicated to the proof of Theorem. 3. Characterizing the set V D The following lemma lists several properties of the functions implementable by degenerate Markov decision processes. Lemma For every f V D we have a) af(λ) V D for every a R. b) f( λ) V D. c) λf(λ) V D. d) f(cλ) V D for every c [0, ]. e) f(λ) + g(λ) V D for every g V D. f) f(λ n ) V D for every n N. 6

7 Proof. Let M f = (S f, µ f, r f, q f ) be a degenerate Markov decision process whose value function is f. To prove Part (a), we multiply all payoffs in M f by a. Formally, define a degenerate Markov decision process M = (S f, µ f, r, q f ) that differs from M only in its payoff function: r (s) := ar f (s) for every s S f. The reader can verify that the value function of M at the initial state s,f is af(λ). To prove Part (b), multiply the payoff in even stages by. Formally, let Ŝ be a copy of S f ; for every state s S f we denote by ŝ its copy in Ŝ. Define a degenerate Markov decision process M = (S f Ŝ, µ f, r, q ) with initial distribution µ f (whose support is S f ) that visits states in Ŝ in even stages and states in S f in odd stages as follows: r (s) := r f (s), r (ŝ) := r f (s), s S f, q (ŝ s) = q (s ŝ) := q f (s s), s, s S f, q (s s) = q (ŝ ŝ) := 0, s, s S f. The reader can verify that the value function of M is f( λ). To prove part (c), add a state with payoff 0 from which the transition probability to a state in S f coincides with µ. Formally, define a degenerate Markov decision process M = (S f {s }, µ, r, q ) in which µ assigns probability to s. r coincides with r f on S f, while r (s ) := 0. Finally, q coincides with q f on S f, while at the state s, q (s s ) := µ(s ). The value function of M is λf(λ). To prove part (d), consider the transition function that at every stage, moves to an absorbing state 4 with payoff 0 with probability c, and with probability c continues as in M. Formally, define a degenerate Markov decision process M = (S f {s }, µ, r, q ) in which µ coincides with µ f, r and q coincide with r f and q f on S f, r (s ) := 0, and q (s s ) := (that is, s is an absorbing state), and q (s s) := c, q (s s) := cq f (s s), s, s S f. The value function of M at the initial state s,f is f(cλ). 4 A state s S is absorbing if q(s s, a) = for every action a A(s). 7

8 To prove Part (e), we show that f + g is in V D and we use part (a) with a =. The function f + g is the value function of the degenerate Markov decision process in which the prior chooses with probability one of two degenerate Markov decision processes that implement f and g. Formally, let M g = (S g, µ g, r g, q g ) be a degenerate Markov decision process whose value function is g. Let M = (S f S g, µ, r, q ) be the degenerate Markov decision process whose state space consists of disjoint copies of S f and S g, the functions r (resp. q ) coincide with r f and r g (resp. q f and q g ) of S f (resp. S g, and the initial distribution is µ = µ f + µ g. The value function of M is f + g. To prove Part (f), we space out the Markov decision process in a way that stage k of the Markov decision process that implements f becomes stage +(k )n, and the payoff in all other stages is 0. Formally, Let M = (S f {,,, n}, µ, r, q ) be a degenerate Markov decision process where µ = µ f and q ((s, k + ) (s, k)) := k {,,..., n }, s S, q ( (s, n)) := q f (s) s S, r ((s, )) := r f (s) s S, r ((s, k)) := 0 k {, 3,..., n}, s S. The value function of M with the prior µ is f(λ n ). Lemma 3 in V D. a) Every polynomial P is in V D and if f V D, then P f is also b) Let P and Q be two polynomials. If Q V D then P Q V D. In particular, if Q divides Q and Q V D then Q V D. c) If Q is a polynomial whose roots are all unit roots of multiplicity, then V Q D. Throughout the paper, whenever we refer to a polynomial we mean a polynomial with real coefficients. 8

9 Proof. Part (a) follows from Lemma (a,c,e) and the observation that any constant function a is in V D, which holds since the constant function a is the value function of the degenerate Markov decision process that starts with a state whose payoff is a and continues to an absorbing state whose payoff is 0. Part (b) follows from Part (a). We turn to prove Part (c). The degenerate Markov decision process with a single state in which the payoff is yields payoff, and therefore V λ λ D. By Lemma (f) it follows that V λ n D, for every n N. Let n be large enough such that Q divides λ n. The result now follows by Part (b). To complete the proof of Theorem we characterize the polynomials Q that satisfy Q V D. To this end we need the following property of V D. Lemma 4 If f, g V D then f(λ)g(λc) V D for every c (0, ). Proof. The proof of Lemma 4 is the most intricate part of the proof of Theorem. We start with an example, that will help us illustrate the formal definition of the degenerate MDP that implements f(λ)g(λc). Let M f and M g be the degenerate Markov decision processes that are depicted in Figure with the initial distributions µ f (s f ) = and µ g (s,g ) =, in which the payoff at each state appears in a square next to the state. Denote by f and g the value functions of M f and M g, respectively. s f s f 3 4 s,g s,g s 3,g s f 3 MDP M f MDP M g Figure : An example of two MDP s. Consider the degenerate Markov decision process M depicted in Figure, where c (0, ) and the initial state is s,g. 9

10 ( c) ( c) 3( c) s c s,g c c,g s 3,g s,f c 4( c) 4 s,f 3 s,f c 4( c) 4 s,f 4 6 s 3,f c 4( c) 4 s 3,f 6 9 s,f s,f 3 s 3,f Figure : The degenerate MDP M. The MDP M is composed of one copy of M g, and for every state in S g it contains one copy of M f. It starts at s,g, the initial state of M g. Then, at every stage, with probability c it continues as in M g, and with probability ( c) it moves to a copy of M f. In case a transition to a copy of M f occurs, the new state is chosen according to the transitions q f ( s f ). This induces a distribution similar to that of the second stage of M f. The payoff in each of the copies of M f is the product of the payoff in M f times the payoff of the state in M g that has been assigned to that copy. The payoff in each state s g S g is ( c)r g (s g ) times the expected payoff in the first stage of M f. Thus, each state in s g S g serves three purposes (see Figure ). First, it is a regular state in the copy of M g. Second, once a transition from M g to a copy of M f occurs (at each stage it occurs with probability ( c)), it serves as the first stage in M f. Finally, once a transition from s g to a copy of M f occurs, the payoffs in the copy are set to the product of the original payoffs in M f times r g (s g ). We now turn to the formal construction of M. Let f, g V D and let M f = (S f, µ f, r f, q f ) (resp. M g = (S g, µ g, r g, q g )) be the degenerate Markov decision process that implements f (resp. g). Define the following degenerate Markov decision process M = (S, µ, r, q): The set of states is S = (S g S f ) S g. In words, the set of states contains 0

11 a copy of S g, and for each state s g S g it contains a copy of S f. The initial distribution is µ g : µ(s g ) = µ g (s g ), s g S g, µ(s g, s f ) = 0, (s g, s f ) S g S f. The transition is as follows: In each copy of S f, the transition is the same as the transition in M f : q((s g, s f) (s g, s f )) := q f (s f s f ), s g S g, s f, s f S f, q((s g, s f) (s g, s f )) := 0, s g s g S g, s f, s f S f. In the copy of S g, with probability c the transition is as in M g, and with probability ( c) it is as in M f starting with the initial distribution µ f : q(s g s g ) := cq g (s g s g ), s g, s g S g, q((s g, s f ) s g ) := ( c) µ(s f)q f (s f s f), s g S g, s f S f, s f S f q((s g, s f ) s g ) := 0 s g s g S g, s f S f. The payoff function is as follows: r(s g ) := ( c)r g (s g ) µ f (s f )r f (s f ), s g S g, s f S f r(s g, s f ) := r g (s g )r f (s f ), s g S g, s f S f. We will now calculate the value function of M. Denote by E µf [r f ] := s f S f µ f (s f )r f (s f ) the expected payoff in M f at the first stage and by R the expected payoff in M f from the second stage and on. Then f(λ) = E νf [r f ] + λr.

12 At every stage, with probability c the process remains in S g and with probability ( c) the process leaves it. In particular, the probability that at stage n the process is still in S g is c n, in which case (a) the payoff is ( c)r g (s n )E µf [r f ], and (b) with probability ( c) the process moves to a copy of M f, and the total discounted payoff from stage n + and on is R. It follows that the total discounted payoff is c n λ n ( ( c)r g (s n )E µf (r f ) + ( c)λr g (s n )R ) n= = ( c) c n λ n r g (s k )f(λ) n= = ( c)g(cλ)f(λ). The result follows by Lemma (a). Lemma Let ω C be a complex number that lies outside the unit ball. 6 a) If ω C \ R then (ω λ)(ω λ) V D. b) If ω R then (ω λ) V D. Proof. We start by proving part (a). For every complex number ω C \ R that lies outside the unit ball there are three natural numbers k < l < m and three nonnegative reals α, α, α 3 that sum up to such that = α ω k + α ω l + α 3 ω m. Consider the degenerate Markov decision process that is depicted in Figure 3. That is, the set of states is S f := {s, s,, s m }, the payoff function is and the transition function is r(s m ) :=, r(s j ) := 0, j < m, q(s m k+ s m ) := α, q(s m l+ s m ) := α,, q(s s m ) := α 3, q(s j+ s j ) =, j < m. 6 For every complex number ω, the conjugate of ω is denoted ω.

13 s s s m l+ s m k+ s m α 3 α α Figure 3: The degenerate MDP in the proof of Lemma. The discounted value satisfies It follows that v λ (s j ) = λv λ (s j+ ), 0 j < m, v λ (s m ) = + λ ( α v λ (s m k+ ) + α v λ (s m l+ ) + α 3 v λ (s ) ). v λ (s m ) = α λ k α λ l α 3 λ m. Hence, this function is in V D. Since ω is one of its roots, by Lemma 3(b) we obtain that (ω λ)(ω λ) V D, as desired. We turn to prove part (b). Let ω be a real number with ω >. By Lemma 3(b) λ is in V D. By Lemma (d), λ ω When ω <, by Lemma (a,b), ω λ V D. V D, and by Lemma (a), ω λ V D. 4 The Proof of Theorem Let Q 0 be a polynomial with real coefficients whose roots are either outside the unit ball or unit roots with multiplicity. To complete the proof of Theorem we prove that Q V D. Denote by Ω the set of all roots of Q that are unit roots, by Ω the set of all roots of Q that lie outside the unit ball and have a positive imaginary part, and by Ω 3 the set of all real roots of Q that lie outside the unit ball. If some roots have multiplicity larger than, then they appear several times in Q or Q 3. For i =, 3 denote Q i = ω Ω i (ω λ) and set Q = ω Ω (ω λ)(ω λ); when Ω i = we set Q i =. Then Q = Q Q Q 3. If Ω, then by Lemma 3(c) we have Q V D. Otherwise Q =, in which case, Q 3 V D by Lemma 3(a).

14 Fix ω Ω and let c R be such that < c < ω. Since ω c lies outside the unit ball, Lemma (a) implies that g ω (λ) := is in V ( D. By Lemma ω c λ)( ω c λ) 4(a), g ω ( λ) c c Q = (ω λ)(ω λ) Q V D. By Lemma (a), (ω λ)(ω λ) Q V D. Applying successively this argument for the remaining roots in Ω, one obtains that Q Q V D. To complete the proof we apply a similar idea to ω Ω 3. Fix ω Ω 3 and let c R be such that < c < ω. By Lemma (b), ω c λ Lemma 4(a) and Lemma (a), (ω λ) Q Q every ω Ω 3 one obtains that Q 3 Q Q Final Remark V D and again by By V D. By iterating this argument for V D, as desired. The set V contain all value functions of MDP s in which the state at the first stage is chosen according to a probability distribution µ. One can wonder whether the set of implementable value functions shrinks if one restricts attention to MDP s in which the first stage is given; that is, µ assigns probability to one state. The answer is negative: the value function of any MDP in which the initial state is chosen according to a probability distribution (prior) can be obtained as the value function of an MDP in which the initial state is deterministic. Indeed, let M be an MDP with a prior. One can construct M by adding to M an initial state s in which the payoff is the expected payoff at the first stage of M and the transitions are the expected transitions after the first stage of M. We applied a similar idea in the proof of Lemma. References [] Blackwell D. (96) Discounted Dynamic Programming. Annals of Mathematical Statistics, 36, 6 3. [] Higham N.J. and Lin L. (0) On p th Roots of Stochastic Matrices. Linear Algebra and its Applications, 3,

Quitting games - An Example

Quitting games - An Example E. Solan 1 and N. Vieille 2 January 22, 2001 Abstract Quitting games are n-player sequential games in which, at any stage, each player has the choice between continuing and