Robust Solutions to Markov Decision Problems

Size: px

Start display at page:

Download "Robust Solutions to Markov Decision Problems"

Steven Sutton
6 years ago
Views:

1 Robust Solutions to Markov Decision Problems Arnab Nilim and Laurent El Ghaoui Deartment of Electrical Engineering and Comuter Sciences University of California, Berkeley, CA Abstract Otimal solutions to Markov Decision Problems (MDPs) are very sensitive with resect to the state transition robabilities. In many ractical roblems, the estimation of those robabilities is far from accurate. Hence, estimation errors are limiting factors in alying MDPs to real-world roblems. We roose an algorithm for solving nite-state and nite-action MDPs, where the solution is guaranteed to be robust with resect to estimation errors on the state transition robabilities. Our algorithm involves a statistically accurate yet numerically ef cient reresentation of uncertainty via likelihood functions. The worst-case comlexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at ractically no extra comuting cost. Introduction We consider a nite-state and nite-action Markov decision roblem in which the transition robabilities themselves are uncertain, and seek a robust decision for it. Our work is motivated by the fact that in many ractical roblems, the transition matrices have to be estimated from data. This may be a dif cult task and the estimation errors may have a huge imact on the solution, which is often quite sensitive to changes in the transition robabilities (Feinberg & Shwartz 2002). A number of authors have addressed the issue of uncertainty in the transition matrices of an MDP. A Bayesian aroach such as described by (Shairo & Kleywegt 2002) requires a erfect knowledge of the whole rior distribution on the transition matrix, making it dif cult to aly in ractice. Other authors have considered the transition matrix to lie in a given set, most tyically a olytoe: see (Satia & Lave 1973; White & Eldeib 1994; Givan, Leach, & Dean 1997). Although our aroach allows to describe the uncertainty on the transition matrix by a olytoe, we may argue against choosing such a model for the uncertainty. First, a general olytoe is often not a tractable way to address the robustness roblem, as it incurs a signi cant additional comutational effort to handle uncertainty. Perhas more imortantly, olytoic models, esecially interval matrices, may be very oor reresentations of statistical uncertainty and Coyright c 2004, American Association for Arti cial Intelligence ( All rights reserved. lead to very conservative robust olicies. In (Bagnell, Ng, & Schneider 2001), the authors consider a roblem dual to ours, and rovide a general statement according to which the cost of solving their roblem is olynomial in roblem size, rovided the uncertainty on the transition matrices is described by convex sets, without roosing any seci c algorithm. This aer is a short version of a longer reort (Nilim & El-Ghaoui 2004), which contains all the roofs of the results summarized here. Notation. P > 0 or P 0 refers to the strict or nonstrict comonentwise inequality for matrices or vectors. For a vector > 0, log refers to the comonentwise oeration. The notation 1 refers to the vector of ones, with size determined from context. The robability simlex in R n is denoted n = { R n + : T 1 = 1}, while Θ n is the set of n n transition matrices (comonentwise non-negative matrices with rows summing to one). We use σ P to denote the suort function of a set P R n, with for v R n, σ P (v) := su{ T v : P}. The roblem descrition We consider a nite horizon Markov decision rocess with nite decision horizon T = {0, 1, 2,..., N 1}. At each stage, the system occuies a state i X, where n = X is nite, and a decision maker is allowed to choose an action a deterministically from a nite set of allowable actions A = {a 1,..., a m } (for notational simlicity we assume that A is not state-deendent). The system starts in a given initial state i 0. The states make Markov transitions according to a collection of (ossibly time-deendent) transition matrices τ := (Pt a ),t T, where for every a A, t T, the n n transition matrix Pt a contains the robabilities of transition under action a at stage t. We denote by π = (a 0,..., a N 1 ) a generic controller olicy, where a t (i) denotes the controller action when the system is in state i X at time t T. Let Π = A nn be the corresonding strategy sace. De ne by c t (i, a) the cost corresonding to state i X and action a A at time t T, and by c N the cost function at the terminal stage. We assume that c t (i, a) is non-negative and nite for every i X and a A. For a given set of transition matrices τ, we de ne the

2 nite-horizon nominal roblem by φ N (Π, τ) := min C N (π, τ), (1) π Π where C N (π, τ) denotes the exected total cost under controller olicy π and transitions τ: ( N 1 ) C N (π, τ) := E c t (i t, a t (i)) + c N (i N ). (2) t=0 A secial case of interest is when the exected total cost function bears the form (2), where the terminal cost is zero, and c t (i, a) = ν t c(i, a), with c(i, a) now a constant cost function, which we assume non-negative and nite everywhere, and ν (0, 1) is a discount factor. We refer to this cost function as the discounted cost function, and denote by C (π, τ) the limit of the discounted cost (2) as N. When the transition matrices are exactly known, the corresonding nominal roblem can be solved via a dynamic rogramming algorithm, which has total comlexity of mn 2 N os in the nite-horizon case. In the in nitehorizon case with a discounted cost function, the cost of comuting an ɛ-subotimal olicy via the Bellman recursion is O(mn 2 log(1/ɛ)); see (Puterman 1994) for more details. The robust control roblems At rst we assume that when for each action a and time t, the corresonding transition matrix Pt a is only known to lie in some given subset P a. Two models for transition matrix uncertainty are ossible, leading to two ossible forms of nite-horizon robust control roblems. In a rst model, referred to as the stationary uncertainty model, the transition matrices are chosen by nature deending on the controller olicy once and for all, and remain xed thereafter. In a second model, which we refer to as the time-varying uncertainty model, the transition matrices can vary arbitrarily with time, within their rescribed bounds. Each roblem leads to a game between the controller and nature, where the controller seeks to minimize the maximum exected cost, with nature being the maximizing layer. Let us de ne our two roblems more formally. A olicy of nature refers to a seci c collection of time-deendent transition matrices τ = (Pt a ),t T chosen by nature, and the set of admissible olicies of nature is T := ( P a ) N. Denote by T s the set of stationary admissible olicies of nature: T s = {τ = (Pt a ),t T T : Pt a = Ps a for every t, s T, a A}. The stationary uncertainty model leads to the roblem φ N (Π, T s ) := min max C N (π, τ). (3) π Π τ T s In contrast, the time-varying uncertainty model leads to a relaxed version of the above: φ N (Π, T s ) φ N (Π, T ) := min max C N (π, τ). (4) π Π τ T The rst model is attractive for statistical reasons, as it is much easier to develo statistically accurate sets of con dence when the underlying rocess is time-invariant. Unfortunately, the resulting game (3) seems to be hard to solve because rincile of otimality may not hold in these roblems due to the deendence of otimal actions of nature among different stages. The second model is attractive as one can solve the corresonding game (4) using a variant of the dynamic rogramming algorithm seen later, but we are left with a dif cult task, that of estimating a meaningful set of con dence for the time-varying matrices P a t. In this aer we will use the rst model of uncertainty in order to derive statistically meaningful sets of con dence for the transition matrices, based on likelihood or entroy bounds. Then, instead of solving the corresonding dif cult control roblem (3), we use an aroximation that is common in robust control, and solve the time-varying uer bound (4), using the uncertainty sets P a derived from a stationarity assumtion about the transition matrices. We will also consider a variant of the nite-horizon time-varying roblem (4), where controller and nature lay alternatively, leading to a reeated game (Π, Q) := min max min max... min max C N (π, τ), a 0 τ 0 Q a 1 τ 1 Q a N 1 τ N 1 Q (5) where the notation τ t = (Pt a ) denotes the collection of transition matrices at a given time t T, and Q := P a is the corresonding set of con dence. Finally, we will consider an in nite-horizon robust control roblem, with the discounted cost function referred to above, and where we restrict control and nature olicies to be stationary: φ re N φ (Π s, T s ) := min max C (π, τ), (6) π Π s τ T s where Π s denotes the sace of stationary control olicies. We de ne φ (Π, T ), φ (Π, T s ) and φ (Π s, T ) accordingly. In the sequel, for a given control olicy π Π and subset S T, the notation φ N (π, S) := max τ S C N (π, τ) denotes the worst-case exected total cost for the nite-horizon roblem, and φ (π, S) is de ned likewise. Main results Our main contributions are as follows. First we rovide a recursion, the robust dynamic rogramming algorithm, which solves the nite-horizon robust control roblem (4). We rovide a simle roof in (Nilim & El-Ghaoui 2004) of the otimality of the recursion, where the main ingredient is to show that erfect duality holds in the game (4). As a corollary of this result, we obtain that the reeated game (5) is equivalent to its non-reeated counterart (4). Second, we rovide similar results for the in nite-horizon roblem with discounted cost function, (6). Moreover, we obtain that if we consider a nite-horizon roblem with a discounted cost function, then the ga between the otimal value of the stationary uncertainty roblem (3) and that of its time-varying counterart (4) goes to zero as the horizon length goes to in- nity, at a rate determined by the discount factor. Finally, we identify several classes of uncertainty models, which result in an algorithm that is both statistically accurate and numerically tractable. We rovide recise comlexity results that imly that, with the roosed aroach, robustness can be handled at ractically no extra comuting cost.

3 Finite-Horizon robust MDP We consider the nite-horizon robust control roblem de- ned in section. For a given state i X, action a A, and P a P a, we denote by a i the next-state distribution drawn from P a corresonding to state i X ; thus a i is the i-th row of matrix P a. We de ne Pi a as the roection of the set P a onto the set of a i -variables. By assumtion, these sets are included in the robability simlex of R n, n ; no other roerty is assumed. The following theorem is roved in (Nilim & El-Ghaoui 2004). Theorem 1 (robust dynamic rogramming) For the robust control roblem (4), erfect duality holds: φ N (Π, T ) = min max C N (π, τ) = max min C N (π, τ) π Π τ T τ T π Π := ψ N (Π, T ). The roblem can be solved via the recursion v t (i) = min ( ct (i, a) + σ P a i (v t+1) ), i X, t T, (7) where σ P (v) := su{ T v : P} denotes the suort function of a set P, v t (i) is the worst-case otimal value function in state i at stage t. A corresonding otimal control olicy π = (a 0,..., a N 1 ) is obtained by setting a { t (i) arg min ct (i, a) + σ P a (v i t+1) }, i X. (8) The effect of uncertainty on a given strategy π = (a 0,..., a N ) can be evaluated by the following recursion vt π (i) = c t (i, a t (i)) + σ a P t (i)(vt+1), π i X, (9) i which rovides the worst-case value function v π for the strategy π. The above result has a nice consequence for the reeated game (5): Corollary 2 The reeated game (5) is equivalent to the game (4): φ re N (Π, Q) = φ N (Π, T ), and the otimal strategies for φ N (Π, T ) given in theorem 1 are otimal for φ re N (Π, Q) as well. The interretation of the erfect duality result given in theorem 1, and its consequence given in corollary 2, is that it does not matter wether the controller or nature lay rst, or if they alternatively; all these games are equivalent. Now consider the following algorithm, where the uncertainty is described in terms of one of the models described in section Kullback-Liebler Divergence Uncertainty Models : Robust Finite Horizon Dynamic Programming Algorithm 1. Set ɛ > 0. Initialize the value function to its terminal value ˆv N = c N. 2. Reeat until t = 0: (a) For every state i X and action a A, comute, using the bisection algorithm given in (Nilim & El- Ghaoui 2004), a value ˆσ i a such that ˆσ a i ɛ/n σ P a i (ˆv t ) ˆσ a i. (b) Udate the value function by ˆv t 1 (i) = min (c t 1 (i, a) + ˆσ a i ), i X. (c) Relace t by t 1 and go to For every i X and t T, set π ɛ = (a ɛ 0,..., a ɛ N 1 ), where a ɛ t(i) = arg max {c t 1(i, a) + ˆσ a i }, i X, a A. As shown in (Nilim & El-Ghaoui 2004), the above algorithm rovides an subotimal olicy π ɛ that achieves the exact otimum with rescribed accuracy ɛ, with a required number of os bounded above by O(mn 2 N log(n/ɛ)). This means that robustness is obtained at a relative increase of comutational cost of only log(n/ɛ) with resect to the classical dynamic rogramming algorithm, which is small for moderate values of N. If N is very large, we can turn instead to the in nite-horizon roblem examined in the following section, and similar comlexity results hold. In nite-horizon MDP In this section, we address the in nite-horizon robust control roblem, with a discounted cost function of the form (2), where the terminal cost is zero, and c t (i, a) = ν t c(i, a), where c(i, a) is now a constant cost function, which we assume non-negative and nite everywhere, and ν (0, 1) is a discount factor. We begin with the in nite-horizon roblem involving stationary control and nature olicies de ned in (6). The following theorem is roved in (Nilim & El-Ghaoui 2004). Theorem 3 (Robust Bellman recursion) For the in nitehorizon robust control roblem (6) with stationary uncertainty on the transition matrices, stationary control olicies, and a discounted cost function with discount factor ν [0, 1), erfect duality holds: φ (Π s, T s ) = max min C (π, τ) := ψ (Π s, T s ). (10) τ T s π Π s The otimal value is given by φ (Π s, T s ) = v(i 0 ), where i 0 is the initial state, and where the value function v satis es the otimality conditions v(i) = min ( c(i, a) + νσp a i (v)), i X. (11) The value function is the unique limit value of the convergent vector sequence de ned by ( v k+1 (i) = min c(i, a) + νσp a (v i k) ), i X, k = 1, 2,... (12) A stationary, otimal control olicy π = (a, a,...) is obtained as a (i) arg min { c(i, a) + νσp a i (v)}, i X. (13)

4 Note that the roblem of comuting the dual quantity ψ (Π s, T s ) given in (10), has been addressed in (Bagnell, Ng, & Schneider 2001), where the authors rovide the recursion (12) without roof. Theorem (3) leads to the following corollary, also roved in (Nilim & El-Ghaoui 2004). Corollary 4 In the in nite-horizon roblem, we can without loss of generality assume that the control and nature olicies are stationary, that is, φ (Π, T ) = φ (Π s, T s ) = φ (Π s, T ) = φ (Π, T s ). (14) Furthermore, in the nite-horizon case, with a discounted cost function, the ga between the otimal values of the nite-horizon roblems under stationary and time-varying uncertainty models, φ N (Π, T ) φ N (Π, T s ), goes to zero as the horizon length N goes to in nity, at a geometric rate ν. Now consider the following algorithm, where we describe the uncertainty using one of the models of section Kullback-Liebler Divergence Uncertainty Models. Robust In nite Horizon Dynamic Programming Algorithm 1. Set ɛ > 0, initialize the value function ˆv 1 > 0 and set k = 1. 2.(a) For all states i and controls a, comute, using the bisection algorithm given in (Nilim & El-Ghaoui 2004), a value ˆσ i a such that ˆσ a i δ σ P a i (ˆv k ) ˆσ a i, where δ = (1 ν)ɛ/2ν. (b) For all states i and controls a, comute ˆv k+1 (i) by, ˆv k+1 (i) = min (c(i, a) + νˆσa i ). 3. If (1 ν)ɛ ˆv k+1 ˆv k <, 2ν go to 4. Otherwise, relace k by k + 1 and go to For each i X, set an π ɛ = (a ɛ, a ɛ,...), where a ɛ (i) = arg max {c(i, a) + νˆσa i }, i X. In (Nilim & El-Ghaoui 2003; 2004), we establish that the above algorithm nds an ɛ-subotimal robust olicy in at most O(mn 2 log(1/ɛ) 2 ) os. Thus, the extra comutational cost incurred by robustness in the in nite-horizon case is only O(log(1/ɛ)). Solving the inner roblem Each ste of the robust dynamic rogramming algorithm involves the solution of an otimization roblem, referred to as the inner roblem, of the form σ P (v) = max P vt, (15) where the variable corresonds to a articular row of a seci c transition matrix, P = Pi a is the set that describes the uncertainty on this row, and v contains the elements of the value function at some given stage. The comlexity of the sets Pi a for each i X and a A is a key comonent in the comlexity of the robust dynamic rogramming algorithm. Note that we can safely relace P in (15) by its convex hull, so that convexity of the sets Pi a is not required; the algorithm only requires the knowledge of their convex hulls. Beyond numerical tractability, an additional criteria for the choice of a seci c uncertainty model is that the sets P a should reresent accurate (non-conservative) descritions of the statistical uncertainty on the transition matrices. Perhas surrisingly, there are statistical models of uncertainty described via Kullback-Liebler divergence that are good on both counts; seci c examles of such models are described in the following sections. Kullback-Liebler Divergence Uncertainty Models We now address the inner roblem (15) for a seci c action a A and state i X. Denote by D( q) denotes the Kullback-Leibler (KL) divergence (relative entroy) from the robability distribution q n to the robability distribution n : D( q) := () log () q(). The above function rovides a natural way to describe errors in (rows of) the transition matrices; examles of models based on this function are given below. Likelihood Models: Our rst uncertainty model is derived from a controlled exeriment starting from state i = 1, 2,..., n and the count of the number of transitions to different states. We denote by F a the matrix of emirical frequencies of transition with control a in the exeriment; denote by fi a its i th row. We have F a 0 and F a 1 = 1, where 1 denotes the vector of ones. The lug-in estimate ˆP a = F a is the solution to the maximum likelihood roblem max P F a (i, ) log P (i, ) : P 0, P 1 = 1. (16) i, The otimal log-likelihood is βmax a = i, F a (i, ) log F a (i, ). A classical descrition of uncertainty in a maximum-likelihood setting is via the likelihood region (Lehmann & Casella 1998) P a = {P R n n : P 0, P 1 = 1, i, F a (i, ) log P (i, ) β a }, where β a < βmax a is a re-seci ed number, which reresents the uncertainty level. In ractice, the designer seci es an uncertainty level β a based on re-samling methods, or on a large-samle Gaussian aroximation, so as to ensure that the set above achieves a desired level of con dence. With the above model, we note that the inner roblem (15) only involves the set Pi a := { a i Rn : a i 0, a i T 1 = 1, where β a i := βa k i F a (i, ) log a i () βa i F a (k, )log F a (k, ). The set },

5 Pi a is the roection of the set described above on a seci c axis of a i -variables. Noting further that the likelihood function can be exressed in terms of KL divergence, the corresonding uncertainty model on the row a i for given i X, a A, is given by a set of the form Pi a = { n : D(fi a ) γa i }, where γi a = F a (i, ) log F a (i, ) βi a is a function of the uncertainty level, and fi a is the ith row of the matrix F a. Maximum A-Posteriori (MAP) Models: a variation on Likelihood models involves Maximum A Posteriori (MAP) estimates. If there exist a rior information regarding the uncertainty on the i-th row of P a, which can be described via a Dirichlet distribution (Ferguson 1974) with arameter αi a, the resulting MAP estimation roblem takes the form max (fi a + αi a 1) T log : T 1 = 1, 0. Thus, the MAP uncertainty model is equivalent to a Likelihood model, with the samle distribution fi a relaced by fi a + αa i 1, where αa i is the rior corresonding to state i and action a. Relative Entroy Models: Likelihood or MAP models involve the KL divergence from the unknown distribution to a reference distribution. We can also choose to describe uncertainty by exchanging the order of the arguments of the KL divergence. This results in a so-called relative entroy model, where the uncertainty on the i-th row of the transition matrix P a described by a set of the form Pi a = { n : D( qi a) γa i }, where γa i > 0 is xed, q i a > 0 is a given reference distribution (for examle, the Maximum Likelihood distribution). Inner roblem with Likelihood Models The inner roblem (15) with the Likelihood uncertainty model is the following, σ := max T v : n, f() log () β, (17) where we have droed the subscrit i and suerscrit a in the emirical frequencies vector fi a and in the lower bound βi a. In this section β max denotes the maximal value of the likelihood function aearing in the above set, which is β max = f() log f(). We assume that β < β max, which, together with f > 0, ensures that the set above has non-emty interior. Without loss of generality, we can assume that v R n +. The dual roblem The Lagrangian L : R n R n R R R associated with the inner roblem can be written as L(v, ζ, µ, ) = T v +ζ T +µ(1 T 1)+(f T log β), where ζ, µ, and are the Lagrange multiliers. The Lagrange dual function d : R n R R R is the maximum value of the Lagrangian over, i.e., for ζ R n, µ R, and R, d(ζ, µ, ) = su L(v, ζ, µ, ) (18) = su( T v + ζ T + µ(1 T 1) + (f T log β)). The otimal = arg su L(v, ζ, µ, ) is readily be obtained by solving L = 0, which results in (i) = f(i) µ v(i) ζ(i). Plugging the value of in the equation for d(ν, µ, ) yields, with some simli cation, the following dual roblem: σ := min,µ,ζ µ (1 + β) + f() log f() µ v() ν(), such that 0, ζ 0, ζ + v µ1. Since the above roblem is convex, and has a feasible set with non-emty interior, there is no duality ga, that is, σ = σ. Moreover, by a monotonicity argument, we obtain that the otimal dual variable ζ is zero, which reduces the number of variables to two: σ = min,µ h(, µ) where µ (1 + β)+ h(, µ) := f() log f() if > 0, µ > v max, µ v() + otherwise. (19) (Note that, v max := max v()). For further reference, we note that h is twice differentiable on its domain, and that its gradient is given by f() f() log h(, µ) = µ v() β 1 f(). (20) µ v() A bisection algorithm From the exression of the gradient obtained above, we obtain that the otimal value of for a xed µ, (µ), is given analytically by (µ) = 1 f(), (21) µ v() which further reduces the roblem to a one-dimensional roblem: σ = min σ(µ), µ v max where v max = max v(), and σ(µ) = h((µ), µ). By construction, the function σ(µ) is convex in its (scalar) argument, since the function h de ned in (19) is ointly convex

6 in both its arguments (see (Boyd & Vandenberghe January 2004,.74)). Hence, we may use bisection to minimize σ. To initialize the bisection algorithm, we need uer and lower bounds µ and µ + on a minimizer of σ. When µ v max, σ(µ) v max and σ (µ) (see (Nilim & El- Ghaoui 2004)). Thus, we may set the lower bound to µ = v max. The uer bound µ + must be chosen such that σ (µ + ) > 0. We have σ (µ) = h h ((µ), µ) + ((µ), µ)d(µ) µ dµ. (22) The rst term is zero by construction, and d(µ)/dµ > 0 for µ > v max. Hence, we only need a value of µ for which h ((µ), µ) = f() log (µ)f() µ v() β > 0. (23) By convexity of the negative log function, and using the fact that f T 1 = 1, f 0, we obtain that h ((µ), µ) = β max β + f() log (µ) ( µ v() ) v() β max β log f()µ (µ) β max β + log (µ) µ v, where v = f T v denotes the average of v under f. The above, combined with the bound on (µ): (µ) µ v max, yields a suf cient condition for (23) to hold: µ > µ 0 + := v max e β βmax v 1 e β βmax. (24) By construction, the interval [v max, µ 0 +] is guaranteed to contain a global minimizer of σ over (v max, + ). The bisection algorithm goes as follows: 1. Set µ = v max and µ + = µ 0 + as in (24). Let δ > 0 be a small convergence arameter. 2. While µ + µ > δ(1 + µ + µ ), reeat (a) Set µ = (µ + + µ )/2. (b) Comute the gradient of σ at µ. (c) If σ (µ) > 0, set µ + = µ; otherwise, set µ = µ. (d) go to 2a. In ractice, the function to minimize may be very at near the minimum. This means that the above bisection algorithm may take a long time to converge to the global minimizer. Since we are only interested in the value of the minimum (and not of the minimizer), we may modify the stoing criterion to µ + µ δ(1 + µ + µ ) or σ (µ + ) σ (µ ) δ. The second condition in the criterion imlies that σ ((µ + + µ )/2) δ, which is an aroximate condition for global otimality. Inner roblem with Maximum A Posteriori Models The inner roblem (15) with MAP uncertainty models takes the form σ := max T v : 0, T 1 = 1, (f() + α() 1) log () κ, where κ deends on the normalizing constant K aearing in the rior density function and on the chosen lower bound on the MAP function, β. We observe that this roblem has exactly the same form as in the case of likelihood function, rovided we relace f by f + α 1. Therefore, the same results aly to the MAP case. Inner roblem with Entroy Models The inner roblem (15) with entroy uncertainty models takes the form σ := max T v : () log () q() β : T 1 = 1, 0. We note that the constraint set actually equals the whole robability simlex if β is too large, seci cally if β max i ( log q i ), since the latter quantity is the maximum of the relative entroy function over the simlex. Thus, if β max i ( log q i ), the worst-case value of T v for P is equal to v max := max v(). Dual roblem By standard duality arguments (set P being of non-emty interior), the inner roblem is equivalent to its dual: min µ + β + ( ) v() µ q() ex 1. >0,µ Setting the derivative with resect to µ to zero, we obtain the otimality condition ( ) v() µ q() ex 1 = 1, from which we derive µ = log The otimal distribution is = q() ex v(). v() q() ex v(i) i q(i) ex As before, we reduce the roblem to a one-dimensional roblem: min >0 σ().

7 where σ is the convex function: σ() = log q() ex v() + β. (25) Perhas not surrisingly, the above function is closely linked to the moment generating function of a random variable v having the discrete distribution with mass q i at v i. A bisection algorithm As roved in (Nilim & El-Ghaoui 2004), the convex function σ in (25) has the following roerties: and where 0, q T v + β σ() v max + β, (26) σ() = v max + (β + log Q(v)) + o(), (27) Q(v) := : v()=v max q() = Prob{v = v max }. Hence, σ(0) = v max and σ (0) = β + log Q(v). In addition, at in nity the exansion of σ is σ() = q T v + β + o(1). (28) The bisection algorithm can be started with the lower bound = 0. An uer bound can be comuted by nding a solution to the equations σ(0) = q T v + β, which yields the initial uer bound 0 + = (v max q T v)/β. By convexity, a minimizer exists in the interval [0 0 +]. Note that if σ (0) 0, then = 0 is otimal and the otimal value of σ is v max. This means that if β is too high, that is, if β > log Q(v), enforcing robustness amounts to disregard any rior information on the robability distribution. We have observed in (Nilim & El-Ghaoui 2004) a similar henomenon brought about by too large values of β, which resulted in a set P equal to the robability simlex. Here, the limiting value log Q(v) deends not only on q but also on v, since we are dealing with the otimization roblem (15) and not only with its feasible set P. Comutational comlexity of the inner roblem Equied with one of the above uncertainty models, we have shown in the revious section that the inner roblem can be converted by convex duality, to a roblem of minimizing a single-variable, convex function. In turn, this onedimensional convex otimization roblem can be solved via a bisection algorithm with a worst-case comlexity of O(n log(v max /δ)) (see (Nilim & El-Ghaoui 2004) for details), where δ > 0 seci es the accuracy at which the otimal value of the inner roblem (15) is comuted, and v max is a global uer bound on the value function. Remark: We can also use models where the uncertainty in the i-th row for the transition matrix P a is described by a nite set of vectors, Pi a = {a,1 i,..., a,k i }. In this case the comlexity of the corresonding robust dynamic rogramming algorithm is increased by a relative factor of K with resect to its classical counterart, which makes the aroach attractive when the number of scenarios K is moderate. Concluding remarks We roosed a robust dynamic rogramming algorithm for solving nite-state and nite-action MDPs whose solutions are guaranteed to tolerate arbitrary changes of the transition robability matrices within given sets. We roosed models based on KL divergence, which is a natural way to describe estimation errors. The resulting robust dynamic rogramming algorithm has almost the same comutational cost as the classical dynamic rogramming algorithm: the relative increase to comute an ɛ-subotimal olicy is O(log(N/ɛ)) in the N-horizon case, and O(log(1/ɛ)) for the in nite-horizon case. References Bagnell, J.; Ng, A.; and Schneider, J Solving uncertain Markov decision roblems. Technical Reort CMU- RI-TR-01-25, Robotics Institute, Carnegie Mellon University. Boyd, S., and Vandenberghe, L. January, Convex Otimization. Cambridge, U.K.: Cambridge University Press. Feinberg, E., and Shwartz, A Handbook of Markov Decision Processes, Methods and Alications. Boston: Kluwer s Academic Publishers. Ferguson, T Prior distributions on sace of robability measures. The Annal of Statistics 2(4): Givan, R.; Leach, S.; and Dean, T Bounded Parameter Markov Decision Processes. In fourth Euroean Conference on Planning, Lehmann, E., and Casella, G Theory of oint estimation. New York, USA: Sringer-Verlag. Nilim, A., and El-Ghaoui, L Robust solution to Markov decision roblems with uncertain transition matrices: roofs and comlexity analysis. In the roceeding of the Neural Information Processing Systems (NIPS, 2004). Nilim, A., and El-Ghaoui, L Curse of uncertainty in markaov decision rocesses: how tro x it. Technical Reort UCB/ERL M04/, Deartment of EECS, University of California, Berkeley. A related version has been submitted to Oerations Research in Dec Puterman, M Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: Wiley- Interscience. Satia, J. K., and Lave, R. L Markov decision rocesses with uncertain transition robabilities. Oerations Research 21(3): Shairo, A., and Kleywegt, A. J Minimax analysis of stochastic roblems. Otimization Methods and Software. to aear. White, C. C., and Eldeib, H. K Markov decision rocesses with imrecise transition robabilities. Oerations Research 42(4):

On a Markov Game with Incomplete Information

On a Markov Game with Incomplete Information On a Markov Game with Incomlete Information Johannes Hörner, Dinah Rosenberg y, Eilon Solan z and Nicolas Vieille x{ January 24, 26 Abstract We consider an examle of a Markov game with lack of information