Efficient Approximate Planning in Continuous Space Markovian Decision Problems

Size: px

Start display at page:

Download "Efficient Approximate Planning in Continuous Space Markovian Decision Problems"

Rudolf Lang
5 years ago
Views:

1 Efficient Approximate Planning in Continuous Space Markovian Decision Problems Csaba Szepesvári a,, a Mindmaker Ltd. Budapest HUGARY - 2 Konkoly-Thege M. u szepes@mindmaker.hu Monte-Carlo planning algorithms for planning in continuous state-space, discounted Markovian Decision Problems MDPs) having a smooth transition law and a finite action space are considered. We prove various polynomial complexity results for the considered algorithms, improving upon several known bounds. Keywords: Markovian Decision Problems, planning, value iteration, Monte-Carlo algorithms. Introduction MDPs provide a clean and simple, yet fairly rich framework for studying various aspects of intelligence, such as planning. A well-known practical limitation of planning in MDPs is called the curse of dimensionality [], referring to the exponential rise in the resources required to compute even approximate) solutions to an MDP as the size of the MDP the number of state variables) increases. For example, conventional dynamic programming DP) algorithms, such as value- or policy-iteration scale exponentially with the size, even if they are combined with sophisticated multigrid algorithms [4]. Moreover, the curse of dimensionality is not specific to any algorithm, as shown by a result of Chow and Tsitsiklis [3]. Recently, Kearns et al. have shown that a certain on-line, tree building algorithm avoids the curse of dimensionality in discounted MDPs [9]. Recently, this result has been extended to partially observable MDPs POMDPs) by the same authors [8]. The bounds in these two papers are independent of the size of the state space, but scale exponentially with γ, the effective horizon-time, where γ is the discount factor of the MDP. In this paper we consider another on-line planning algorithm that will be shown to scale polynomially with the horizon-time, as well. The price of this is that we have to assume more regularity on the MDPs we consider. In particular, we will restrict ourselves to stochastic MDPs with finite action spaces and state space X = [0, ] d, and, more importantly, assume that the transition probability kernel of the MDPs are subject to the Lipschitzcondition px x, a) px x 2, a) L p x x 2 for any states x, x 2, x [0, ] d and action a A. Here L p > 0 is a given fixed number and denotes the l norm of vectors. Another restriction quite common in the literature) that we will assume is the uniform boundedness of the transition probabilities the bound shall be denoted by K p ) and of the immediate rewards bound denoted by K r ). Further, our bounds will depend on the dimension of the state space, d. The idea of the considered algorithms originates in the algorithm considered by Rust [3]. 2 Rust studied a more restricted class of problems than considered in this paper and proved the following result. First, let us define the concept of ε-optimality in the mean. Fix an MDP with state space X. A random, real-valued function ˆV with domain X ] is called ε-optimal in the mean if E[ ˆV V ε, where V is the optimal value function underlying the selected MDP and is the maximum-norm and the expectation is The bounds developed by Kearns et. al do not exhibit any dependence on the state space. 2 The algorithm will be given in the next section. AI Communications ISS , IOS Press. All rights reserved

2 2 taken for the random function ˆV. The input of the algorithm is a tolerance number, ε > 0. Given any ε > 0, the algorithm first builds up a random) cache C ε. Then, given a query state x X and the cache C ε, the algorithm draws a sample of a random function ˆV x), ˆV being ε-optimal in the mean. Rust has shown that both phases of the algorithm are polynomial in A, K r /ε γ)), L p, L r, d, K p. Here L r is the Lipschitz factor of the immediate rewards. ote that Rust s bound scales polynomially with the effective horizon-time, so our approach will be to extend his algorithm to planning. The very first idea along this way is to make use of Markov s inequality. The algorithm based on this idea would work as follows: Fix the random sample and consider ˆV as given by Rust s algorithm, and a state x. Using Markov s inequality one gets that P ˆV V δ) ε/δ. ow, imagine that we can compute argmax {rx, a) + γ py x, a) ˆV y) } dy. A contraction argument would then show that drawing = polyk r /εδ), L p, L r, A, d, K r, K p, / γ)) samples is sufficient for ensuring the ε-optimality of π with probability at least δ. ow, the A integrals can themselves be approximated by Monte- Carlo methods. 3 The computational complexity of the resulting algorithm will depend polynomially on and will thus scale polynomially with L r and /δ. There are a number of methods to boost the polynomial dependence on /δ to log/δ). Here, we are going to use maximal inequalities to arrive at such a result. This method will have the additional benefit that we can get rid of the Lipschitzian condition regarding the immediate rewards and boost the polynomial dependence of the complexity bounds on L p to a poly-logarithmic one. Interestingly, our bound for the number of samples will be poly-logarithmic in the size of the action space, as well. ote, however, the the com- 3 One might either want to reuse the samples drawn earlier or draw new samples. The second approach is easier to analyze, whilst the first one may appear more elegant for some. plexity bounds will still scale polynomially with the size of the action space. We will also derive novel bounds for the complexity of calculating uniformly optimal policies. The organization of the paper is as follows: In Section 2 we provide the necessary background. The algorithm is given in Section 3, the main result of the paper is formulated in Section 4. The proof of the main result is given in Section 5, and conclusions are drawn in Section Preliminaries We assume that the reader is familiar with the basics of the theory of MDPs. Readers who lack the necessary background are referred to the book of Dynkin and Yuskevich [6] or the more recent books [2] and []. 2.. otation Let p [, +]. p refers to the l p norm of vectors and the L p norm of functions, depending on the type of its argument. Lip p denotes the set of mappings that are Lipschitz-continuous in the norm p : f Lip p means that there exists a positive constant L > 0 s.t. fx) fy) p L x y p domains of the mappings are suppressed). L is called the p -Lipschitz constant of f. Lip p γ) Lip p denotes the set of mappings whose p -Lipschitz constant is not larger than γ. A mapping T is called a contraction in the norm p if T Lip p γ) for some 0 γ <. Let V be any set, T : V V and S : V V. Then the mapping T S : V V is defined by T S)v = T Sv), v V. The set of natural numbers will be denoted by, the set of reals by R. If t then T t denotes the map that is the product of T with itself t-times. We say that T = S iff T v = Sv holds for all v V. ω will in general denote an elementary event of the probability space under consideration, lhs means left-hand-side, and rhs means right-hand-side. We define BX ) to be the set of all bounded realvalued function over X : BX ) = {f : X R : f < +, f is measurable}. Further, for any K > 0, B K X ) shall denote the set of all bounded functions whose maximum-norm is below the constant K: B K X ) = { f BX ) : f < K }.

3 3 Table Pseudo-code of the algorithm 0. Input: x X query state), ε > 0 tolerance), p, r, γ, A model parameters).. Compute t and as defined in Theorem Draw X,..., X independent samples uniformly distributed over X. 3. Compute ˆp X : X i X j, a) i, j ) using ˆp x : x i x, a) = px i x, a)/ P j= px j x, a) if P j= px j x, a) > 0, and let ˆp x : x i x, a) = 0 otherwise. 4. Let v i = 0, i. 5. Repeat t times: v i := max {rx i, a) + γ P j= ˆp X : X j X i, a) v j }, i. 6. Let a = argmax {rx, a) + γ P j= ˆp X : X j x, a) v j }. 7. Return a The Model Let us consider the continuous space discounted MDP given by X, A, p, r, γ), where X = [0, ] d d > 0, d ) is the state space, A is the action space, p is a measurable transition density: px x, a) 0 and px x, a)dx = x, a) X A, r : X A R is a measurable function, called the reward function and 0 γ < is the discount factor. We further assume the followings: Assumption 2.. A is finite. Assumption 2.2. There exist constants K p, L p > 0 s.t. p K p and py, a) Lip L p ) for all y, a) X A. Assumption 2.3. There exists some constant K r > 0 s.t. r < K r. 3. The Algorithm The pseudo-code of the algorithm yielding uniformly approximately optimal policies can be seen in Table. ote that at the expense of increasing the computation time one may downscale the storage requirement of the algorithm from O 2 ) to O) if Step 3 of the algorithm is omitted. Then Equation 2) must be used in Steps 5 and 6. ote that one may still precompute the normalizing factor of 2) for speeding up the computations since the storage requirements for these normalizing factors depend only linearly on. Rust s original algorithm builds up the cache C ε = v,..., v ) using steps 5 with some and t. Then, for any query state x X his algorithm returns the random value ˆV x) = max {rx, a) + γ j= ˆp X : X j x, a) v j }. It can be readily seen that our algorithm is just a straightforward extension of the one considered by Rust, the difficulty lies in deriving appropriate bounds for and t. ow, we introduce the notations needed to state the main results. Let T a : BX ) BX ) be defined by T a V )x) = rx, a) + γ py x, a)v y)dy. Here a A is arbitrary and the integral should be understood here and in what follows to be over X. For a stationary policy π : X A, let T π : BX ) BX ) be defined by T π V )x) = T πx) V )x). Finally, let the Bellman-operator T : BX ) BX ) be defined by T V )x) = max {T av )x)}. Under our assumptions, T is known to have a unique fixed-point, V, called the optimal-value function. V is known to be uniformly bounded. It is also known that any stationary) policy π : X A satisfying T π V = T V is optimal in the sense that for any given initial state the total expected discounted return resulting from the execution of π is maximal. The execution of a policy π : X A means the execution of action πx) whenever the state is x.) A policy is called myopic or greedy w.r.t. the function V BX ) if T π V = T V. Since in our case the action set A is finite, the existence of a myopic policy is guaranteed for any given uniformly bounded function V. ow let x,..., x X be fixed elements of the state space. For brevity, let us denote the -tuple

4 4 x,..., x ) by x :. Let ˆT x : a : BX ) BX ) be defined by ˆT x : av )x) = rx, a) + γ where ˆp x : x i x, a) = { pxi x,a) P j= pxj x,a); ˆp x : x i x, a)v x i ), i= 0; otherwise. if j= px j x, a) > 0, ) 2) The operator ˆT x : a is obtained from T a by approximating the integrals in T a by finite sums. It should be clear that because of the Lipschitz conditions on p, ˆT x : a does approximate T a and the quality of approximation depends on the distribution of the points x :. Using ˆT x : a we introduce the operator ˆT x : that is meant to approximate T. It is defined as follows: ˆTx : : BX ) BX ), and ˆT x : V )x) = max { ˆT x : av )x)}. 3) ow, analogeously with the previous definitions, ˆT x : π is introduced by ˆT x : πv )x) = ˆT x : πx)v )x). Throughout the paper we are going to work with independent random variables X,..., X, being uniformly distributed over X. 4 Similarly to the notation introduced for deterministic -tuples of state space points, X : will be used to denote X,..., X ). We define the random operators ˆT a, ˆT π and ˆT by the respective equations ˆT a = ˆT X : a, ˆT = ˆT X :. ˆT π = ˆT X : π, and 4 The uniform distribution is used for simplicity only. Any other sampling distribution with support covering X could be used if the algorithm is modified appropriately importance sampling) [7]. The form of the ideal sampling distribution is far from being clear since a single sample-set is used to estimate an infinite number of integrals. The form of the ideal distribution should be the subject of future research. Here ˆT is called the random Bellman-operator. A great deal of effort in this paper will be devited to show that ˆT and its powers approximate the true Bellman-operator T and its respective powers uniformly well, with high probability. In order to connect the algorithm with the operators defined so far, let us introduce the projection operator ˆP x : : BX ) R defined by ˆP x : V ) = V x ),..., V x n )), and the expansion operators Ê x : a, Êx: : R BX ) defined by the respective equations Êx : av)x) = rx, a) + γ ˆp x : x j x, a)vx j ), a A, and j= Êx: v)x) = max {Êx: av)x)}. Finally, let the finite state-space Bellman operator ˆL x : : R R be defined by ˆL x : v) i = max {rx i, a)+γ ˆp x : x j x i, a)v j }. j= The following proposition highlights the connection between the algorithm and these operators: Proposition 3.. For any integer t > 0, and in particular, Proof. By inspection. ˆT t+ x : = Êx : ˆL t x : ˆP x : ˆT t+ = ÊX : ˆL t X : ˆP X :. Remark 3.2. According to Proposition 3., one t+ can compute ˆT V )x) in two phases, the first of which we could call the off-line phase and the second of which we could call the on-line phase. In the off-line phase one computes the - dimensional vector v t) = ˆL t ˆP X : X : V, which takes Ot 2 A ) time, whilst in the second phase t+ one computes the value of ˆT V )x) by evaluating V )x) = ÊX: vt) )x). This second ˆT t+

5 5 step takes O 2 A ) time and thus the whole procedure takes Ot 2 A ) time. Further, it is easy to see that the procedure takes O + A ) space. 5 ow, the algorithm whose pseudocode was given above can be formulated as follows: Assume that we are given a fixed tolerance, ε > 0. On the basis of ε and L p, K r, A, γ we compute some integer t > 0 and another integer > 0. Each time we need to compute an action of the randomized policy π for some state x, we draw a random sample X : and compute v t) = ˆL t X : ˆP X : V 0 where V 0 x) = 0. Then a random action of πx) is computed by evaluating argmax ÊX : av t) )x). 4) The action of the argmax operator is returned. The resulting policy will be shown to be ε-optimal. Another, computationally less expensive method is to hold the random sample X : fixed and compute v t) only once. Then the computation of πx) using 4) costs only O A 2 ) steps. 4. Results The first result that we will prove shows that the algorithm just described at the end of the previous section yields uniformly approximately optimal policies with high probability and has polynomial complexity: Theorem 4.. Let K = K r / γ) and let ε > 0, δ > 0, V 0 B K X ) be fixed. Let t = tε, γ, K), where log8k) + log/ε γ))) tε, γ, K) = log/γ) and let 52K 2 K 2 p ) 2 24K + ) ε γ) 2 log 8 + log tε, γ, K) + ) + log A Let V = ˆT t V 0 and let the stationary policy π be defined by ˆT π V = ˆT V. Then P V π V ε) δ. Further, the complexity of the algorithm is polynomial in d, ε, K, K p, logl p ), A and / γ). ote that ideally the bound on should depend only on K/ε, so that scaling the rewards would not change the complexity results. The bound given in the above theorem does not have this property, it has some K s without a corresponding ε. The cause of this will become clear during the course of the proof of this theorem and, more specifically, in the proof of Lemma 5.7. ote that if ε is sufficiently small, an upper bound on the above expression can always be derived by replacing K by K/ε at those occurancies of K that lack a corresponding ε term. This way one gets a less tight, but in some sense) a better behaving bound. The next result shows that the modified, fully on-line algorithm given in Table yields a uniformly approximately optimal policy and has polynomial complexity. The above comments on scaling the rewards apply to this result, too. Theorem 4.2. Let K = K r / γ) and let ε > 0 Fix some V 0 B K X ). Let ε = ε γ)/2+γ)), δ = ε γ)/4k)= ε γ) 2 /4K r ) ). Further, let t = tε, γ, K) and let be the smallest integer larger than 52K 2 K 2 p ) 2 96K + ) ε γ) 3 log 8 + log t + ) + log A 768K + ) 2 ) L p d + d log ε γ) 3 + ) ) 4K + log. ε γ) 384K + ) 2 L p d + d log ε γ) 2 ) + + log 5 Here we assume that the basic algebraic operations over reals take O) time and that the storage of a real-number takes O) space. We also assume that ˆp X : is not stored. ) ). δ Let V = ˆT t V 0 and let the stochastic stationary policy π : X A [0, ] be defined by πx, a) = Pπ X : x) = a), where π X : is the policy defined by ˆT V = ˆT πx : V. Then, π is ε-optimal and given a state x, a random action of π can be computed in time and space polynomial in K r /ε, d, K r, K p, log L p, A and / γ).

6 6 The rough outlines of the proofs of these theorems are as follows: Under our assumptions, Pollard s maximal inequality cf. [0]) ensures that for any given fixed function V 0, ˆT V 0 T V 0 is small with high probability. 6 Using the triangle inequality one reduces the comparison of ˆT n V 0 and T n V 0 to those of ˆT T k V 0 and T T k V 0, where k varies from zero to n. More precisely, one shows that if the differences between ˆT T k V 0 and T T k V 0 are small for all k = 0,..., n then ˆT n V 0 T n V 0 will be small, too. Using this result, it is then easy to prove a maximal inequality for ˆT n V 0 T n V 0. ow, one can use standard contraction arguments to prove an inequality that bounds the difference of the value of a policy that is approximately greedy w.r.t. some function V in terms of the Bellman-residuals see e.g. [4]). The plan is to use this inequality for V = ˆT n V 0 and ˆT. Some more calculations yield Theorem 4.. Then, it is proven that if a policy selects only good actions i.e., actions from A ε x) = { a A : T a V )x) T V )x) ε } for a suitable ε) then it is good itself i.e., close to optimal). ext, we relax the condition of selecting good actions to selecting good actions with high probability. Such policies can be shown to be good, as well cf. Lemma 5 of [9]). Finally, it is shown that if a policy is good with high probability then it selects good actions with high probability and thus, in turn, it must be good. This will finish the proof of Theorem 4.2. One source of the complexity of the proof stems from the fact that Pollard s inequality cannot be used in a simple way to bound ˆT n V 0 T n V 0. This is because the usual induction argument that would bound ˆT n V 0 T n V 0 based on a n bound on ˆT V 0 T n V 0 does not quite work here. Typically, one argues that if ˆT approximates T uniformly well over the space of bounded functions or some space of functions of interest) then ˆT n V 0 T n V 0 will be small if 6 We must rely on Pollard s maximal inequality instead of the simpler Chernoff-bounds because the state space is continuous and the sup-norm above involves a supremum over the state space. Further, this result is derived in two steps, using an idea of Rust [3]. n ˆT V 0 T n V 0 is small. Unfortunately, the space of all bounded functions is just too rich in our case: ˆT cannot approximate T uniformly well over this rather complex space. A smaller, but still appropriate space F is needed - hence the complicated proof. 5. Proof We prove the theorem in the next three sections. First, we prove some maximal inequalities for the random Bellman-operators ˆT a. ext we show how these can be extended to powers of ˆT and, finally, we apply all these to prove the main results. 5.. Maximal Inequalities for Random Bellman Operators We shall need some auxilliary operators which are easier to deal with using probability theorey. Let T a : BX ) BX ) T a V )x) = rx, a) + γ T : BX ) BX ) px i x, a)v X i ); i= T V )x) = max { T a V )x)}. Operator T a is a simple Monte-Carlo estimate of operator T a and will be shown to converge uniformly to T a using standard methods. Unfortunately, T a is not suitable for further analysis as it can be a non-contraction, and in order to analyze the iterations in our algorithms, the contraction property of the approximate Bellman operators will be needed. Hence the algorithms use ˆT a and in a second step T a will be related to ˆT a, and the approximation results will be extended to ˆT a. We need some definitions and results from the theory of uniform deviations cf. [0]). Definition 5.. Let A R d. The set S A is an ε-cover of A if for all t A there exists an element s of S s.t. d t s ε. The set of ε-covers of A will be denoted CA; ε).

7 7 Definition 5.2. The ε-covering number of a set A is defined by ε, A) = min{ S : S CA; ε) }. The number log ε, A) is called the metric entropy of A. Let z :n = z,..., z n ) R d ) n and let F R Rd. We define Fz :n ) = { fz ),..., fz n )) : f F } R n. 5) The following theorem is due to Pollard see [0]): Theorem 5.3 Pollard, 984). Let n > 0 be an integer, ε > 0, M > 0, F [0, M] Rd be a set of measurable functions. Let X,..., X n R d be i.i.d. random variables. Then ) n P sup fx i ) E[fX )] f F n > ε i= 6) [ ε 8E ))] 8, FX:n) e nε2 28M 2. An elegant proof of this theorem can be found in [5][pp. 492]. In general, some further assumptions are needed to make the result of the above sup measurable. Measurability problems, however, are now well understood so we shall not worry about this detail. Readers who keep worrying should take all the probability bounds except for the main result as outer/inner-probability bounds whichever is appropriate). ote that in the final result we work with measurable sets and therefore there is no need to refer to outer/inner probability measures. Firstly, we extend this theorem to functions mapping R d into [ M, M]. Corollary Let n > 0 be an integer, ε > 0, M > 0, F [ M, M] Rd be a set of measurable functions. Let X,..., X n R d be i.i.d. random variables. Then ) n P sup fx i ) E[fX )] f F n > ε i= 7) [ ε 8E ))] 8, FX:n) e nε2 52M 2. Proof. Apply Theorem 5.3 to f M = f + M. Definition 5.4. Let d, d > 0 and let σ > 0. Let { Gridσ) = 2i σ,..., 2i d σ) [0, ] d : } 0 i k, k d, i k 2σ and let P σ : [0, ] d Gridσ) be defined by P σ x = argmin y { x y : y Gridσ) } where ties are broken in favor of points having smaller coordinates. Remark 5.5. x P σ x d σ and Gridσ) 2σ + ) d ow we can prove our first result concerning the approximation of T a by T a. Lemma 5.6. Let K > 0, ε > 0 and δ > 0. Further, let B 0 B K X ) be a finite set, p d, ε, δ, K, K p, L p, B 0, A, γ) = 52K 2 Kp 2 ε 2 log 8 + log B 0 + log A 6KLp d + d log ε ) + + log ) ) δ and p d, ε, δ, K, K p, L p, B 0, A, γ). Then 8) P max max T ) a V T a V > ε δ. 9) V B 0 Proof. We shall make use of Corollary Let Fx : ) = { z V, x, a) : V B 0, a A, x X }, where z V, x, a) = V X )px x, a),..., V X )px x, a)). Easily, z V, x, a) [ K K p, K K p ]. In order to bound ε, FX : )) from above, we construct an ε-cover of FX : ). We claim that S σ = { z V, x, a) : V B 0, a A, x Gridσ) } is an ε-cover of FX : ) if σ is chosen appropriately. In order to prove this let us pick up an arbitrary element z V, x, a) of FX : ). Then

8 8 z V, x, a) z V, P σ x, a) = V X i )px i x, a) V X i )px i P σ x, a) i= V px i x, a) px i P σ x, a) i= KL p d σ. Therefore, if σ = ε/kl p d) then S σ is an ε-cover of FX : ). By Remark 5.5, ε, FX : )) d B 0 A + ). By Corollary 5.3., if 2KLp d ε log 8 + log B 0 + log A ) 6KLp d + d log + + log δ ε ε 2 52K 2 Kp 2 then 9) holds. ow, we shall prove a similar result for ˆT a, using ideas from the proof of the Corollary to Theorem 3.4 of [3]. Lemma 5.7. Let K > 0, ε > 0 and δ > 0. Further, let B 0 B K X ) be a finite set, p 2 d, ε, δ, K, K p, L p, B 0, A, γ) = ) 2 K + 52K 2 Kp 2 log 8 + log B 0 + ) ε 6K + ) 2 ) L p d + log A + d log + ε ) ) + log δ If p 2 d, ε, δ, K, K p, L p, B 0, A, γ) then 0) P max max ˆT ) a V T a V > ε δ. ) V B 0 Proof. Let us pick up some V B 0. By the triangle inequality ˆT a V T a V Let ˆT a V T a V + T a V T a V. 2) p x, a) = px i x, a). i= If p x, a) = 0 then ˆT a V )x) T a V )x) = 0. If p x, a) 0 then by simple algebraic manipulations we get ˆT a V )x) T a V )x) = γ p x, a) p x, a) px i x, a)v X i ). i= Since, by assumption V X i ) K, we have ˆT a V )x) T a V )x) γk p x, a). 3) Let e : X R be defined by ex) = and observe that γ p x, a)) = T a e)x) T a e)x) and therefore by 3) we have ˆT a V )x) T a V )x) K T a e)x) T a e)x). ote that this inequality holds also when p x, a) = 0. Taking the supremum over X yields ˆT a V T a V K T a e T a e. By 2) we have ˆT a V T a V K T a e T a e + T a V T a V K + ) max T a V T a V. V B 0 {e} Therefore max max ˆT a V T a V V B 0 K + ) max max T a V T a V V B 0 {e}. ow, the statement of the lemma follows using Lemma 5.6 with the choice p d, ε/k + ), δ, K +, K p, L p, B 0 +, A, γ).

9 Maximal Inequalities for Powers of Random Bellman Operators First we need a proposition that relates the fixed point of a contraction operator and an operator that is approximating the contraction. Proposition 5.8. Let B be a space of bounded functions 7, and fix some V B and integer t > 0. Let T, T 2 : B B be operators of B such that T Lip γ) for some 0 γ < and T T s 2 V T 2 T s 2 V α, 0 s t 4) for some α > 0. Then T t V T2V t α γ. 5) Proof. We prove the statement by induction; namely, we prove that T s V T s 2 V α γ 6) holds for all 0 s t. The statement is obvious for s = 0. Assume that we have already proven 6) for s. By the triangle inequality, T s V T2 s V T T s V T T2 s V + T T2 s V T 2 T2 s V. Since T Lip γ), the first term of the rhs can be bounded by γ T s V T2 s V, which in turn can be bounded by γα/ γ), by the induction hypothesis. The second term, on the other hand, can be bounded by α, by 4). Since γα/ γ) + α = α/ γ), inequality 6) holds for s as well, thus proving the proposition. We cite the next proposition without proof, as the proof is both elementary and is well known. Proposition 5.9. Let K = K r / γ). Then the Bellman-operator T maps B K X ) into B K X ). ow follows the main result of this section. Lemma 5.0. Let t > 0 be an integer, ε > 0, δ > 0, K = K r / γ), V 0 B K X ). Let 7 More generally, B could be any Banach-space. p 3 d, ε, δ, K, K p, L p, B 0, A, γ) = ) 2 K + 52K 2 Kp 2 log 8 + log B 0 + ) ε γ) 6K + ) 2 ) L p d + log A + d log + ε γ) ) ) + log. δ If p 3 d, ε, δ, K, K p, L p, B 0, A, γ) then { P max max ˆT a T t V 0 T a T t V 0, ) ˆT } V t 0 T t V 0 ε δ. 7) Proof. Let V s = T s V 0, 0 s t, B 0 = {V 0, V,..., V t }. By Proposition 5.9, B 0 B K X ). By Lemma 5.7, if p 2 d, ε γ), δ, K, K p, L p, B 0, A, γ) then P max max ˆT ) a V T a V ε γ) δ. V B 0 Let the elementary random event ω be such that max max ˆT a ω)v T a V ε γ). V B 0 If we show that max ˆT a ω)t t V 0 T a T t V 0, max ˆT ω) t V 0 T t V 0 ) ε then the proof will be finished. Obviously, max 8) ˆT a ω)t t V 0 T a T t V 0 ε 9) by the construction of B 0 and since γ. ow, note that

10 0 ˆT ω)v T V max ˆT a ω)v T a V holds for all V BX ). Since by the choice of ω and, max ˆT a ω)t s V 0 T a T s V 0 ε γ), 0 s t, we also have ˆT ω)t s V 0 T T s V 0 ε γ), 0 s t. Moreover, since ˆT ω) Lip γ), Proposition 5.8 can be applied with the choice B = BX ), T = ˆT ω), T 2 = T and V = V 0, yielding ˆT ω) t V 0 T t V 0 ε. This together with 9) yields 8), thus proving the theorem Proving the ε-optimality of the Algorithm First, we prove an inequality similar to that of [4], but here we use both approximate value functions and approximate operators. Lemma 5.. Let V BX ), x : X for some > 0 and let π : X A be such that Then V π V 2 γ ˆT x : πv = ˆT x : V. max T a V ˆT x : av + γ T V V ). 20) ote that since A is finite, the policy defined in the lemma exists. Proof. We compare Tπ k V and T k V since these are known to converge to V π and V, respectively. Firstly, we write the difference Tπ k V T k V in the form of a telescoping sum: k Tπ k V T k V = T i+ π V TπV i ) i= k + T π V T V ) T i+ V T i V ). i= Using the triangle inequality, the relations T, T π Lip γ), and the inequality γ k + γ k γ γ/ γ), we get T k π V T k V γ T π V V γ + T V V ) + T π V T V. Using the identity ˆT x : πv = ˆT x : V, we write T π V T V = and thus 2) T π V ˆT ) ) x : πv + ˆTx : V T V T π V T V T π V ˆT x : πv + ˆT x : V T V 2 max T a V ˆT x : av. 22) On the other hand, T π V V T π V T V + T V V, and therefore by 2), T k π V T k V 2γ γ T V V ) γ + γ + T π V T V which combined with 22) yields T k π V T k V 2 γ T V V γ + max T a V ˆT x : av ). Taking the limes superior of both sides when k yields the lemma.

11 ote that if ˆTx : a = T a then we get back the tight bounds of [4]. 8 The next lemma exploits that if V t = ˆT t x V : 0 for some V 0 B K X ) then the Bellman-error T V t V t can be related to the quality of approximation of T a by ˆT x : a. Lemma 5.2. Let K = K r / γ), ε > 0 and let V 0 B K X ) fixed. Let log8k) + log/ε γ))) t = tε, γ, K) =, log/γ) proving the lemma. V π V ε, ow, we are in the position to prove the first main result that was stated as Theorem 4. before: Theorem 5.3. Let K = K r / γ) and let ε > 0, δ > 0, V 0 B K X ) be fixed. Let x : X, V t = ˆT t x : V 0 and assume that max T a V t ˆT x : av t ε γ) 4 + γ). 23) and let t = tε, γ, K) Further, let π : X A s.t. ˆTx : πv t = ˆT x : V t. Then π is ε-optimal, i.e., V π V ε. Proof. We use Lemma 5.. Let V = V t and let us bound the Bellman-error T V t V t first: T V t V t T V t ˆT x : V t + ˆT x : V t V t max T a V t ˆT x : av t t+ + ˆT V x : 0 ˆT t x V : 0. Since ˆTx : Lip γ), the second term is bounded by γ t ˆTx : V 0 V 0 γ t ˆTx : V 0 + V 0 ) 2K γ t, where we have used that ˆT x : : B K X ) B K X ) and V 0 B K X ). Therefore, by Lemma 5. we have V π V 2 + γ) + γ max 4K γt+ γ. T a V t ˆT x : av t Using the definition of t and 23) we get 8 ote that the lemma still holds if we replace the special operators ˆT x : a, ˆTx : π and ˆT x : by operators ˆT a, ˆT π, ˆT Lip γ) satisfying ˆT πv )x) = ˆT πx) V )x) and ˆT V )x) = max ˆT av )x). p 4 d, ε, δ, K, K p, L p, A, γ) = ) 2 24K + ) 52K 2 Kp 2 ε γ) 2 log 8 + log tε, γ, K) + ) 384K + ) 2 ) L p d + log A + d log ε γ) 2 + ) ) + log. δ 24) Let p 4 d, ε, δ, K, K p, L p, A, γ). Let V = ˆT t V 0 and let the stationary policy π be defined by ˆT π V = ˆT V. Then P V π V ε) δ. 25) Proof. The proof combines Lemmas 5.2 and 5.0. Firstly, we bound m = max Let π : X A be defined by ˆT t t a ˆT V 0 T a ˆT V 0. πx) = argmax ˆT a ˆT t V 0 T a ˆT t V 0 π does not depend on x). Then

12 2 m = ˆT t t π ˆT V 0 T π ˆT V 0 ˆT t π ˆT V 0 ˆT π T t V 0 + ˆT π T t V 0 T π T t V 0 + T π T t t V 0 T π ˆT V 0 2γ ˆT t V 0 T t V 0 + max ˆT a T t V 0 T a T t V 0 Therefore if 2γ + ) { max max ˆT a T t V 0 T a T t V 0, ˆT t V 0 T t V 0 }. p 3 d, ε γ)/42γ + )γ + )), δ, K, K p, L p, tε, γ, K), A, γ) then by Lemma 5.0 and Lemma 5.2, V π V ε with probability at least δ. In order to finish the proof of the main theorem we will prove that in discounted problems stochastic policies that generate ε-optimal actions with high probability are uniformly good. This result appears in the context of finite models in [9]. For completeness, we present the proof here. We start with the definition of ε-optimal actions and then prove three simple lemmas. Definition 5.4. Let ε > 0, and consider a discounted MDP X, A, p, r, γ). We call the set A ε x) = { a A : T a V )x) T V )x) ε }. the set of ε-optimal actions. Elements of this set are called ε-optimal. Lemma 5.5. Let π : X A [0, ] be a stationary stochastic policy that selects only ε-optimal actions: for all x X and a A from πx, a) > 0 it follows that a A ε x). Then V π V ε/ γ). Proof. From the definition of π it is immediate that T π V V ε. Clearly, T π V V and T π V )x) = πx, a)t a V )x) = ε x) εx) = V x) ε. πx, a)t a V )x) πx, a) T V )x) ε) ow, consider the telescoping sum k Tπ k V = T π V + T i+ π V TπV i ). Therefore, i= T k π V V k T π V V + T i+ π V TπV i ε + i= γ γ ε = ε γ. The next lemma will be applied to show if two policies are close to each other then so are their evaluation functions. Both the lemma and its proof are very similar to those of Proposition 5.8. Lemma 5.6. Let B be a space of bounded functions 9, B K { V B : V K }. Assume that T, T 2 : B K B K are such that for some α > 0 T V T 2 V α holds for all V B K and T Lip γ) for some 0 γ <. Then T s V T2 s V α/ γ). Further, let V be the fixed point of T and V2 be the fixed point of T 2. If T 2 Lip γ) then V V2 α/ γ). Proof. The proof is almost identical to that of Proposition 5.8. One proves by induction that T s V T s 2 V α/ γ) holds for all s 0. Here V B K is fixed. Indeed, the inequality holds for s = 0. Assuming that it holds for s with s one gets 9 Again, B could be any Banach-space.

13 3 T s V T2 s V T T s V T T2 s V + T T2 s V T 2 T2 s V γα/ γ) + α = α/ γ), showing the first part of the statement. The second part is proven by taking the limes superior of both sides when s. ow, we are ready to prove the lemma showing that policies that choose ε-optimal actions with high probability are uniformly good. Lemma 5.7. Let ε > 0, > δ > 0 be given. Let π : X A [0, ] be a stochastic policy that selects ε-optimal actions with probability at least δ. Then V π V ε + 2Kδ)/ γ). Proof. Let δx) = a A εx) πx, a) denote the probability of selecting non-ε-optimal actions in state x x X ). By assumption, δx) δ <. Let π : X A [0, ] be the policy defined by π x, a) = { πx,a) δx), if a A εx), 0, otherwise. We claim that T π and T π are close to each other. For, let V B K X ), where K = K r / γ). T π V )x) T π V )x) = πx, a) π x, a))t a V )x) and since T a V K, T π V T π V K Further, πx, a) π x, a) = ε x) + a A εx) πx, a) π x, a) πx, a) π x, a) πx, a) π x, a). = πx, a) πx, a) + πx, a) δx) ε x) a A ε x) = 2δx) 2δ. Therefore, T π V T π V 2Kδ. Since T π and T π and B K X ) satisfy the assumptions of Lemma 5.6, and the fixed point of T π and T π are V π and V π, respectively, we have V π V π 2Kδ/ γ). 26) Further, by construction π selects only ε-optimal actions and thus by Lemma 5.5, V π V ε/ γ). Combining this with 26), we get that V π V ε + 2Kδ)/ γ), finishing the proof. We are ready to prove the main result of the paper that was states earlier as Theorem 4.2: Theorem 5.8. Let K = K r / γ) and let ε > 0 Fix some V 0 B K X ). Let ε = ε γ)/2+γ)), δ = ε γ)/4k)= ε γ) 2 /4K r ) ). Further, let log32k/ε γ) t = tε 2 )), γ, K) = 27) log/γ) and let p 5 d, ε, K, K p, L p, A, γ) = ) 2 96K + ) 52K 2 Kp 2 ε γ) 3 log 8 + log t + ) 768K + ) 2 ) L p d + log A + d log ε γ) 3 + ) ) 4K + log. ε γ) Choose 28) p 5 d, ε, K, K p, L p, A, γ) 29) and let V = ˆT t V 0. Further, let the stochastic stationary policy π : X A [0, ] be defined by πx, a) = Pπ X : x) = a), 30) where π X : is the policy defined by ˆT πx : V = ˆT V. Then π is ε-optimal and given a state x, a random action of π can be computed in time and space polynomial in /ε, d, K, log L p, A and / γ).

14 4 Proof. The second part of the statement is immediate cf. Remark 3.2). The bound on the time of computation is Ot + ) 2 A ) 3) and the space requirement of the algorithm is 0 O + A ) 32) For the first part, fix X :. By Theorem 5.3, if V = V then V satisfies P V V πx : ε ) δ. We claim that if ω is such that V ω) V ε then π X : ω)a) A ε +γ)x). Let us pick up such an ω, let T = T ω); πx : note that V = T V. Then T V V T V T V + V V +γ)ε. Therefore, using the definition of A ε x), we get that π X : ω)a) A ε +γ)x). This shows that P π X : x) A ε +γ)x) ) δ. ow, by Lemma 5.7, the policy π defined by 30) is ε + γ) + 2Kδ)/ γ))-optimal, i.e., V π V ε + γ) + 2Kδ)/ γ). Substituting the definitions of ε and δ yields the result. 6. Conclusions and Further Work In this article we have considered an on-line planning algorithm that was shown to avoid the curse of dimensionality. Bounds following from Rust s original result by Markov s inequality were improved on in several ways: our bounds depend poly-logarithmically on the Lipschitz constant of the transition probabilities, they do not depend on the Lipschitz constant of the immediate rewards we dropped the assumption of having Lipschitzcontinuous immediate reward functions), and the number of samples depends on the cardinality of the action set in a poly-logarithmic way, as well. It is interesting to note that although our bounds depend poly-logarithmically on the Lipschitz constant of the transition probabilities char- 0 Assuming that only the normalization factors of the transition probabilities ˆp X : are stored. acterizing how fast the dynamics is), they depend polynomially on the bound of the transition probabilities characterizing the randomness of the MDP). Therefore, perhaps not surprisingly, for these kind of Monte-Carlo algorithms faster dynamics are easier to cope with than less random dynamics with peaky transition probability functions). As a consequence of our result, many interesting questions arise. For example, different variants of the proposed algorithm could be compared, such as multigrid versions, versions using quasi-random numbers, or versions that use importance sampling could be compared. In practice, one would probably choose not to recompute the cache C ε for each query. Also, in practice, one would probably precompute the transition probability table ˆp X : X i X j, a) and in order to speed-up the iterations one would probably eliminate the computation with those transition probability values that are very close to zero. This would considerably speed up the computations as one would expect that distant parts of the state space are uncoupled. However, the theoretical effect of these modifications needs to be explored. ote that the Lipschitz condition on p can be replaced by an appropriate condition on the metric-entropy of px, a) and the proofs will still go through. Therefore the proofs can be extended to Hölder-classes of transition laws or local Lipschitz classes e.g. px x, a) px x 2, a) Lx, a) x x 2 ) in this case one would need to use bracketing numbers), smooth functions, Sobolev classes, etc. One of the most interesting problems is to extend the results to infinite action spaces. For sure, such an extension needs some regularity assumptions on the dependence of the transition probability law and the reward function on the actions. It would also be interesting to prove analogous results for discrete MDPs having a factorized representation. The presented algorithm may find applications in economic problems without any modifications [2]. We also work on applications on deterministic continuous state-space, finite-action space control problems and partially observable MDPs over discrete spaces. Also, a combination with look-a-head search can be interesting from the practical point of view. The algorithm considered in the article was tried in practice on some standard problems car-on-

15 5 the-hill, acrobot) and it was observed to yield a reasonable performance even when the number of samples was kept quite small in the range of a few hundred to few thousand samples). It was also observed that boundary effects can interfere negatively with the algorithm. Details of these experiments, however, will be described elsewhere. References [] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, ew Jersey, 957. [2] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, J, USA, 989. [3] C.S. Chow and J.. Tsitsiklis. The complexity of dynamic programming. Journal of complexity, 5: , 989. [4] C.S. Chow and J.. Tsitsiklis. An optimal multigrid algorithm for continuous state discrete time stochastic control. IEEE Transactions on Automatic Control, 368):898 94, 99. [5] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Applications of Mathematics: Stochastic Modelling and Applied Probability. Springer-Verlag ew York, 996. [6] E.B. Dynkin and A.A. Yushkevich. Controlled Markov Processes. Springer-Verlag, Berlin, 979. [7] G.S. Fishman. Monte Carlo Concepts, Algorithms, and Applications. Springer-Verlag, 999. [8] M. Kearns, Y. Mansour, and A.Y. g. Approximate planning in large POMDPs via reusable trajectories. In S. A. Solla, T. K. Leen, and K. R. Müller, editors, Advances in eural Information Processing Systems 2. MIT Press, Cambridge, MA, 999. to appear. [9] M. Kearns, Y. Mansour, and A.Y. g. A sparse sampling algorithm for near-optimal planning in large Markovian decision processes. In Proceedings of IJ- CAI 99, 999. [0] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, ew York, 984. [] M.L. Puterman. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., ew York, Y, 994. [2] J. Rust. Structural estimation of Markov decision processes. In Handbook of Econometrics, volume 4, chapter 5, pages orth Holland, 994. [3] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65:487 56, 996. [4] R. J. Williams and L.C. Baird, III. Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 994.

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes