Efficient Approximate Planning in Continuous Space Markovian Decision Problems
|
|
- Rudolf Lang
- 5 years ago
- Views:
Transcription
1 Efficient Approximate Planning in Continuous Space Markovian Decision Problems Csaba Szepesvári a,, a Mindmaker Ltd. Budapest HUGARY - 2 Konkoly-Thege M. u szepes@mindmaker.hu Monte-Carlo planning algorithms for planning in continuous state-space, discounted Markovian Decision Problems MDPs) having a smooth transition law and a finite action space are considered. We prove various polynomial complexity results for the considered algorithms, improving upon several known bounds. Keywords: Markovian Decision Problems, planning, value iteration, Monte-Carlo algorithms. Introduction MDPs provide a clean and simple, yet fairly rich framework for studying various aspects of intelligence, such as planning. A well-known practical limitation of planning in MDPs is called the curse of dimensionality [], referring to the exponential rise in the resources required to compute even approximate) solutions to an MDP as the size of the MDP the number of state variables) increases. For example, conventional dynamic programming DP) algorithms, such as value- or policy-iteration scale exponentially with the size, even if they are combined with sophisticated multigrid algorithms [4]. Moreover, the curse of dimensionality is not specific to any algorithm, as shown by a result of Chow and Tsitsiklis [3]. Recently, Kearns et al. have shown that a certain on-line, tree building algorithm avoids the curse of dimensionality in discounted MDPs [9]. Recently, this result has been extended to partially observable MDPs POMDPs) by the same authors [8]. The bounds in these two papers are independent of the size of the state space, but scale exponentially with γ, the effective horizon-time, where γ is the discount factor of the MDP. In this paper we consider another on-line planning algorithm that will be shown to scale polynomially with the horizon-time, as well. The price of this is that we have to assume more regularity on the MDPs we consider. In particular, we will restrict ourselves to stochastic MDPs with finite action spaces and state space X = [0, ] d, and, more importantly, assume that the transition probability kernel of the MDPs are subject to the Lipschitzcondition px x, a) px x 2, a) L p x x 2 for any states x, x 2, x [0, ] d and action a A. Here L p > 0 is a given fixed number and denotes the l norm of vectors. Another restriction quite common in the literature) that we will assume is the uniform boundedness of the transition probabilities the bound shall be denoted by K p ) and of the immediate rewards bound denoted by K r ). Further, our bounds will depend on the dimension of the state space, d. The idea of the considered algorithms originates in the algorithm considered by Rust [3]. 2 Rust studied a more restricted class of problems than considered in this paper and proved the following result. First, let us define the concept of ε-optimality in the mean. Fix an MDP with state space X. A random, real-valued function ˆV with domain X ] is called ε-optimal in the mean if E[ ˆV V ε, where V is the optimal value function underlying the selected MDP and is the maximum-norm and the expectation is The bounds developed by Kearns et. al do not exhibit any dependence on the state space. 2 The algorithm will be given in the next section. AI Communications ISS , IOS Press. All rights reserved
2 2 taken for the random function ˆV. The input of the algorithm is a tolerance number, ε > 0. Given any ε > 0, the algorithm first builds up a random) cache C ε. Then, given a query state x X and the cache C ε, the algorithm draws a sample of a random function ˆV x), ˆV being ε-optimal in the mean. Rust has shown that both phases of the algorithm are polynomial in A, K r /ε γ)), L p, L r, d, K p. Here L r is the Lipschitz factor of the immediate rewards. ote that Rust s bound scales polynomially with the effective horizon-time, so our approach will be to extend his algorithm to planning. The very first idea along this way is to make use of Markov s inequality. The algorithm based on this idea would work as follows: Fix the random sample and consider ˆV as given by Rust s algorithm, and a state x. Using Markov s inequality one gets that P ˆV V δ) ε/δ. ow, imagine that we can compute argmax {rx, a) + γ py x, a) ˆV y) } dy. A contraction argument would then show that drawing = polyk r /εδ), L p, L r, A, d, K r, K p, / γ)) samples is sufficient for ensuring the ε-optimality of π with probability at least δ. ow, the A integrals can themselves be approximated by Monte- Carlo methods. 3 The computational complexity of the resulting algorithm will depend polynomially on and will thus scale polynomially with L r and /δ. There are a number of methods to boost the polynomial dependence on /δ to log/δ). Here, we are going to use maximal inequalities to arrive at such a result. This method will have the additional benefit that we can get rid of the Lipschitzian condition regarding the immediate rewards and boost the polynomial dependence of the complexity bounds on L p to a poly-logarithmic one. Interestingly, our bound for the number of samples will be poly-logarithmic in the size of the action space, as well. ote, however, the the com- 3 One might either want to reuse the samples drawn earlier or draw new samples. The second approach is easier to analyze, whilst the first one may appear more elegant for some. plexity bounds will still scale polynomially with the size of the action space. We will also derive novel bounds for the complexity of calculating uniformly optimal policies. The organization of the paper is as follows: In Section 2 we provide the necessary background. The algorithm is given in Section 3, the main result of the paper is formulated in Section 4. The proof of the main result is given in Section 5, and conclusions are drawn in Section Preliminaries We assume that the reader is familiar with the basics of the theory of MDPs. Readers who lack the necessary background are referred to the book of Dynkin and Yuskevich [6] or the more recent books [2] and []. 2.. otation Let p [, +]. p refers to the l p norm of vectors and the L p norm of functions, depending on the type of its argument. Lip p denotes the set of mappings that are Lipschitz-continuous in the norm p : f Lip p means that there exists a positive constant L > 0 s.t. fx) fy) p L x y p domains of the mappings are suppressed). L is called the p -Lipschitz constant of f. Lip p γ) Lip p denotes the set of mappings whose p -Lipschitz constant is not larger than γ. A mapping T is called a contraction in the norm p if T Lip p γ) for some 0 γ <. Let V be any set, T : V V and S : V V. Then the mapping T S : V V is defined by T S)v = T Sv), v V. The set of natural numbers will be denoted by, the set of reals by R. If t then T t denotes the map that is the product of T with itself t-times. We say that T = S iff T v = Sv holds for all v V. ω will in general denote an elementary event of the probability space under consideration, lhs means left-hand-side, and rhs means right-hand-side. We define BX ) to be the set of all bounded realvalued function over X : BX ) = {f : X R : f < +, f is measurable}. Further, for any K > 0, B K X ) shall denote the set of all bounded functions whose maximum-norm is below the constant K: B K X ) = { f BX ) : f < K }.
3 3 Table Pseudo-code of the algorithm 0. Input: x X query state), ε > 0 tolerance), p, r, γ, A model parameters).. Compute t and as defined in Theorem Draw X,..., X independent samples uniformly distributed over X. 3. Compute ˆp X : X i X j, a) i, j ) using ˆp x : x i x, a) = px i x, a)/ P j= px j x, a) if P j= px j x, a) > 0, and let ˆp x : x i x, a) = 0 otherwise. 4. Let v i = 0, i. 5. Repeat t times: v i := max {rx i, a) + γ P j= ˆp X : X j X i, a) v j }, i. 6. Let a = argmax {rx, a) + γ P j= ˆp X : X j x, a) v j }. 7. Return a The Model Let us consider the continuous space discounted MDP given by X, A, p, r, γ), where X = [0, ] d d > 0, d ) is the state space, A is the action space, p is a measurable transition density: px x, a) 0 and px x, a)dx = x, a) X A, r : X A R is a measurable function, called the reward function and 0 γ < is the discount factor. We further assume the followings: Assumption 2.. A is finite. Assumption 2.2. There exist constants K p, L p > 0 s.t. p K p and py, a) Lip L p ) for all y, a) X A. Assumption 2.3. There exists some constant K r > 0 s.t. r < K r. 3. The Algorithm The pseudo-code of the algorithm yielding uniformly approximately optimal policies can be seen in Table. ote that at the expense of increasing the computation time one may downscale the storage requirement of the algorithm from O 2 ) to O) if Step 3 of the algorithm is omitted. Then Equation 2) must be used in Steps 5 and 6. ote that one may still precompute the normalizing factor of 2) for speeding up the computations since the storage requirements for these normalizing factors depend only linearly on. Rust s original algorithm builds up the cache C ε = v,..., v ) using steps 5 with some and t. Then, for any query state x X his algorithm returns the random value ˆV x) = max {rx, a) + γ j= ˆp X : X j x, a) v j }. It can be readily seen that our algorithm is just a straightforward extension of the one considered by Rust, the difficulty lies in deriving appropriate bounds for and t. ow, we introduce the notations needed to state the main results. Let T a : BX ) BX ) be defined by T a V )x) = rx, a) + γ py x, a)v y)dy. Here a A is arbitrary and the integral should be understood here and in what follows to be over X. For a stationary policy π : X A, let T π : BX ) BX ) be defined by T π V )x) = T πx) V )x). Finally, let the Bellman-operator T : BX ) BX ) be defined by T V )x) = max {T av )x)}. Under our assumptions, T is known to have a unique fixed-point, V, called the optimal-value function. V is known to be uniformly bounded. It is also known that any stationary) policy π : X A satisfying T π V = T V is optimal in the sense that for any given initial state the total expected discounted return resulting from the execution of π is maximal. The execution of a policy π : X A means the execution of action πx) whenever the state is x.) A policy is called myopic or greedy w.r.t. the function V BX ) if T π V = T V. Since in our case the action set A is finite, the existence of a myopic policy is guaranteed for any given uniformly bounded function V. ow let x,..., x X be fixed elements of the state space. For brevity, let us denote the -tuple
4 4 x,..., x ) by x :. Let ˆT x : a : BX ) BX ) be defined by ˆT x : av )x) = rx, a) + γ where ˆp x : x i x, a) = { pxi x,a) P j= pxj x,a); ˆp x : x i x, a)v x i ), i= 0; otherwise. if j= px j x, a) > 0, ) 2) The operator ˆT x : a is obtained from T a by approximating the integrals in T a by finite sums. It should be clear that because of the Lipschitz conditions on p, ˆT x : a does approximate T a and the quality of approximation depends on the distribution of the points x :. Using ˆT x : a we introduce the operator ˆT x : that is meant to approximate T. It is defined as follows: ˆTx : : BX ) BX ), and ˆT x : V )x) = max { ˆT x : av )x)}. 3) ow, analogeously with the previous definitions, ˆT x : π is introduced by ˆT x : πv )x) = ˆT x : πx)v )x). Throughout the paper we are going to work with independent random variables X,..., X, being uniformly distributed over X. 4 Similarly to the notation introduced for deterministic -tuples of state space points, X : will be used to denote X,..., X ). We define the random operators ˆT a, ˆT π and ˆT by the respective equations ˆT a = ˆT X : a, ˆT = ˆT X :. ˆT π = ˆT X : π, and 4 The uniform distribution is used for simplicity only. Any other sampling distribution with support covering X could be used if the algorithm is modified appropriately importance sampling) [7]. The form of the ideal sampling distribution is far from being clear since a single sample-set is used to estimate an infinite number of integrals. The form of the ideal distribution should be the subject of future research. Here ˆT is called the random Bellman-operator. A great deal of effort in this paper will be devited to show that ˆT and its powers approximate the true Bellman-operator T and its respective powers uniformly well, with high probability. In order to connect the algorithm with the operators defined so far, let us introduce the projection operator ˆP x : : BX ) R defined by ˆP x : V ) = V x ),..., V x n )), and the expansion operators Ê x : a, Êx: : R BX ) defined by the respective equations Êx : av)x) = rx, a) + γ ˆp x : x j x, a)vx j ), a A, and j= Êx: v)x) = max {Êx: av)x)}. Finally, let the finite state-space Bellman operator ˆL x : : R R be defined by ˆL x : v) i = max {rx i, a)+γ ˆp x : x j x i, a)v j }. j= The following proposition highlights the connection between the algorithm and these operators: Proposition 3.. For any integer t > 0, and in particular, Proof. By inspection. ˆT t+ x : = Êx : ˆL t x : ˆP x : ˆT t+ = ÊX : ˆL t X : ˆP X :. Remark 3.2. According to Proposition 3., one t+ can compute ˆT V )x) in two phases, the first of which we could call the off-line phase and the second of which we could call the on-line phase. In the off-line phase one computes the - dimensional vector v t) = ˆL t ˆP X : X : V, which takes Ot 2 A ) time, whilst in the second phase t+ one computes the value of ˆT V )x) by evaluating V )x) = ÊX: vt) )x). This second ˆT t+
5 5 step takes O 2 A ) time and thus the whole procedure takes Ot 2 A ) time. Further, it is easy to see that the procedure takes O + A ) space. 5 ow, the algorithm whose pseudocode was given above can be formulated as follows: Assume that we are given a fixed tolerance, ε > 0. On the basis of ε and L p, K r, A, γ we compute some integer t > 0 and another integer > 0. Each time we need to compute an action of the randomized policy π for some state x, we draw a random sample X : and compute v t) = ˆL t X : ˆP X : V 0 where V 0 x) = 0. Then a random action of πx) is computed by evaluating argmax ÊX : av t) )x). 4) The action of the argmax operator is returned. The resulting policy will be shown to be ε-optimal. Another, computationally less expensive method is to hold the random sample X : fixed and compute v t) only once. Then the computation of πx) using 4) costs only O A 2 ) steps. 4. Results The first result that we will prove shows that the algorithm just described at the end of the previous section yields uniformly approximately optimal policies with high probability and has polynomial complexity: Theorem 4.. Let K = K r / γ) and let ε > 0, δ > 0, V 0 B K X ) be fixed. Let t = tε, γ, K), where log8k) + log/ε γ))) tε, γ, K) = log/γ) and let 52K 2 K 2 p ) 2 24K + ) ε γ) 2 log 8 + log tε, γ, K) + ) + log A Let V = ˆT t V 0 and let the stationary policy π be defined by ˆT π V = ˆT V. Then P V π V ε) δ. Further, the complexity of the algorithm is polynomial in d, ε, K, K p, logl p ), A and / γ). ote that ideally the bound on should depend only on K/ε, so that scaling the rewards would not change the complexity results. The bound given in the above theorem does not have this property, it has some K s without a corresponding ε. The cause of this will become clear during the course of the proof of this theorem and, more specifically, in the proof of Lemma 5.7. ote that if ε is sufficiently small, an upper bound on the above expression can always be derived by replacing K by K/ε at those occurancies of K that lack a corresponding ε term. This way one gets a less tight, but in some sense) a better behaving bound. The next result shows that the modified, fully on-line algorithm given in Table yields a uniformly approximately optimal policy and has polynomial complexity. The above comments on scaling the rewards apply to this result, too. Theorem 4.2. Let K = K r / γ) and let ε > 0 Fix some V 0 B K X ). Let ε = ε γ)/2+γ)), δ = ε γ)/4k)= ε γ) 2 /4K r ) ). Further, let t = tε, γ, K) and let be the smallest integer larger than 52K 2 K 2 p ) 2 96K + ) ε γ) 3 log 8 + log t + ) + log A 768K + ) 2 ) L p d + d log ε γ) 3 + ) ) 4K + log. ε γ) 384K + ) 2 L p d + d log ε γ) 2 ) + + log 5 Here we assume that the basic algebraic operations over reals take O) time and that the storage of a real-number takes O) space. We also assume that ˆp X : is not stored. ) ). δ Let V = ˆT t V 0 and let the stochastic stationary policy π : X A [0, ] be defined by πx, a) = Pπ X : x) = a), where π X : is the policy defined by ˆT V = ˆT πx : V. Then, π is ε-optimal and given a state x, a random action of π can be computed in time and space polynomial in K r /ε, d, K r, K p, log L p, A and / γ).
6 6 The rough outlines of the proofs of these theorems are as follows: Under our assumptions, Pollard s maximal inequality cf. [0]) ensures that for any given fixed function V 0, ˆT V 0 T V 0 is small with high probability. 6 Using the triangle inequality one reduces the comparison of ˆT n V 0 and T n V 0 to those of ˆT T k V 0 and T T k V 0, where k varies from zero to n. More precisely, one shows that if the differences between ˆT T k V 0 and T T k V 0 are small for all k = 0,..., n then ˆT n V 0 T n V 0 will be small, too. Using this result, it is then easy to prove a maximal inequality for ˆT n V 0 T n V 0. ow, one can use standard contraction arguments to prove an inequality that bounds the difference of the value of a policy that is approximately greedy w.r.t. some function V in terms of the Bellman-residuals see e.g. [4]). The plan is to use this inequality for V = ˆT n V 0 and ˆT. Some more calculations yield Theorem 4.. Then, it is proven that if a policy selects only good actions i.e., actions from A ε x) = { a A : T a V )x) T V )x) ε } for a suitable ε) then it is good itself i.e., close to optimal). ext, we relax the condition of selecting good actions to selecting good actions with high probability. Such policies can be shown to be good, as well cf. Lemma 5 of [9]). Finally, it is shown that if a policy is good with high probability then it selects good actions with high probability and thus, in turn, it must be good. This will finish the proof of Theorem 4.2. One source of the complexity of the proof stems from the fact that Pollard s inequality cannot be used in a simple way to bound ˆT n V 0 T n V 0. This is because the usual induction argument that would bound ˆT n V 0 T n V 0 based on a n bound on ˆT V 0 T n V 0 does not quite work here. Typically, one argues that if ˆT approximates T uniformly well over the space of bounded functions or some space of functions of interest) then ˆT n V 0 T n V 0 will be small if 6 We must rely on Pollard s maximal inequality instead of the simpler Chernoff-bounds because the state space is continuous and the sup-norm above involves a supremum over the state space. Further, this result is derived in two steps, using an idea of Rust [3]. n ˆT V 0 T n V 0 is small. Unfortunately, the space of all bounded functions is just too rich in our case: ˆT cannot approximate T uniformly well over this rather complex space. A smaller, but still appropriate space F is needed - hence the complicated proof. 5. Proof We prove the theorem in the next three sections. First, we prove some maximal inequalities for the random Bellman-operators ˆT a. ext we show how these can be extended to powers of ˆT and, finally, we apply all these to prove the main results. 5.. Maximal Inequalities for Random Bellman Operators We shall need some auxilliary operators which are easier to deal with using probability theorey. Let T a : BX ) BX ) T a V )x) = rx, a) + γ T : BX ) BX ) px i x, a)v X i ); i= T V )x) = max { T a V )x)}. Operator T a is a simple Monte-Carlo estimate of operator T a and will be shown to converge uniformly to T a using standard methods. Unfortunately, T a is not suitable for further analysis as it can be a non-contraction, and in order to analyze the iterations in our algorithms, the contraction property of the approximate Bellman operators will be needed. Hence the algorithms use ˆT a and in a second step T a will be related to ˆT a, and the approximation results will be extended to ˆT a. We need some definitions and results from the theory of uniform deviations cf. [0]). Definition 5.. Let A R d. The set S A is an ε-cover of A if for all t A there exists an element s of S s.t. d t s ε. The set of ε-covers of A will be denoted CA; ε).
7 7 Definition 5.2. The ε-covering number of a set A is defined by ε, A) = min{ S : S CA; ε) }. The number log ε, A) is called the metric entropy of A. Let z :n = z,..., z n ) R d ) n and let F R Rd. We define Fz :n ) = { fz ),..., fz n )) : f F } R n. 5) The following theorem is due to Pollard see [0]): Theorem 5.3 Pollard, 984). Let n > 0 be an integer, ε > 0, M > 0, F [0, M] Rd be a set of measurable functions. Let X,..., X n R d be i.i.d. random variables. Then ) n P sup fx i ) E[fX )] f F n > ε i= 6) [ ε 8E ))] 8, FX:n) e nε2 28M 2. An elegant proof of this theorem can be found in [5][pp. 492]. In general, some further assumptions are needed to make the result of the above sup measurable. Measurability problems, however, are now well understood so we shall not worry about this detail. Readers who keep worrying should take all the probability bounds except for the main result as outer/inner-probability bounds whichever is appropriate). ote that in the final result we work with measurable sets and therefore there is no need to refer to outer/inner probability measures. Firstly, we extend this theorem to functions mapping R d into [ M, M]. Corollary Let n > 0 be an integer, ε > 0, M > 0, F [ M, M] Rd be a set of measurable functions. Let X,..., X n R d be i.i.d. random variables. Then ) n P sup fx i ) E[fX )] f F n > ε i= 7) [ ε 8E ))] 8, FX:n) e nε2 52M 2. Proof. Apply Theorem 5.3 to f M = f + M. Definition 5.4. Let d, d > 0 and let σ > 0. Let { Gridσ) = 2i σ,..., 2i d σ) [0, ] d : } 0 i k, k d, i k 2σ and let P σ : [0, ] d Gridσ) be defined by P σ x = argmin y { x y : y Gridσ) } where ties are broken in favor of points having smaller coordinates. Remark 5.5. x P σ x d σ and Gridσ) 2σ + ) d ow we can prove our first result concerning the approximation of T a by T a. Lemma 5.6. Let K > 0, ε > 0 and δ > 0. Further, let B 0 B K X ) be a finite set, p d, ε, δ, K, K p, L p, B 0, A, γ) = 52K 2 Kp 2 ε 2 log 8 + log B 0 + log A 6KLp d + d log ε ) + + log ) ) δ and p d, ε, δ, K, K p, L p, B 0, A, γ). Then 8) P max max T ) a V T a V > ε δ. 9) V B 0 Proof. We shall make use of Corollary Let Fx : ) = { z V, x, a) : V B 0, a A, x X }, where z V, x, a) = V X )px x, a),..., V X )px x, a)). Easily, z V, x, a) [ K K p, K K p ]. In order to bound ε, FX : )) from above, we construct an ε-cover of FX : ). We claim that S σ = { z V, x, a) : V B 0, a A, x Gridσ) } is an ε-cover of FX : ) if σ is chosen appropriately. In order to prove this let us pick up an arbitrary element z V, x, a) of FX : ). Then
8 8 z V, x, a) z V, P σ x, a) = V X i )px i x, a) V X i )px i P σ x, a) i= V px i x, a) px i P σ x, a) i= KL p d σ. Therefore, if σ = ε/kl p d) then S σ is an ε-cover of FX : ). By Remark 5.5, ε, FX : )) d B 0 A + ). By Corollary 5.3., if 2KLp d ε log 8 + log B 0 + log A ) 6KLp d + d log + + log δ ε ε 2 52K 2 Kp 2 then 9) holds. ow, we shall prove a similar result for ˆT a, using ideas from the proof of the Corollary to Theorem 3.4 of [3]. Lemma 5.7. Let K > 0, ε > 0 and δ > 0. Further, let B 0 B K X ) be a finite set, p 2 d, ε, δ, K, K p, L p, B 0, A, γ) = ) 2 K + 52K 2 Kp 2 log 8 + log B 0 + ) ε 6K + ) 2 ) L p d + log A + d log + ε ) ) + log δ If p 2 d, ε, δ, K, K p, L p, B 0, A, γ) then 0) P max max ˆT ) a V T a V > ε δ. ) V B 0 Proof. Let us pick up some V B 0. By the triangle inequality ˆT a V T a V Let ˆT a V T a V + T a V T a V. 2) p x, a) = px i x, a). i= If p x, a) = 0 then ˆT a V )x) T a V )x) = 0. If p x, a) 0 then by simple algebraic manipulations we get ˆT a V )x) T a V )x) = γ p x, a) p x, a) px i x, a)v X i ). i= Since, by assumption V X i ) K, we have ˆT a V )x) T a V )x) γk p x, a). 3) Let e : X R be defined by ex) = and observe that γ p x, a)) = T a e)x) T a e)x) and therefore by 3) we have ˆT a V )x) T a V )x) K T a e)x) T a e)x). ote that this inequality holds also when p x, a) = 0. Taking the supremum over X yields ˆT a V T a V K T a e T a e. By 2) we have ˆT a V T a V K T a e T a e + T a V T a V K + ) max T a V T a V. V B 0 {e} Therefore max max ˆT a V T a V V B 0 K + ) max max T a V T a V V B 0 {e}. ow, the statement of the lemma follows using Lemma 5.6 with the choice p d, ε/k + ), δ, K +, K p, L p, B 0 +, A, γ).
9 Maximal Inequalities for Powers of Random Bellman Operators First we need a proposition that relates the fixed point of a contraction operator and an operator that is approximating the contraction. Proposition 5.8. Let B be a space of bounded functions 7, and fix some V B and integer t > 0. Let T, T 2 : B B be operators of B such that T Lip γ) for some 0 γ < and T T s 2 V T 2 T s 2 V α, 0 s t 4) for some α > 0. Then T t V T2V t α γ. 5) Proof. We prove the statement by induction; namely, we prove that T s V T s 2 V α γ 6) holds for all 0 s t. The statement is obvious for s = 0. Assume that we have already proven 6) for s. By the triangle inequality, T s V T2 s V T T s V T T2 s V + T T2 s V T 2 T2 s V. Since T Lip γ), the first term of the rhs can be bounded by γ T s V T2 s V, which in turn can be bounded by γα/ γ), by the induction hypothesis. The second term, on the other hand, can be bounded by α, by 4). Since γα/ γ) + α = α/ γ), inequality 6) holds for s as well, thus proving the proposition. We cite the next proposition without proof, as the proof is both elementary and is well known. Proposition 5.9. Let K = K r / γ). Then the Bellman-operator T maps B K X ) into B K X ). ow follows the main result of this section. Lemma 5.0. Let t > 0 be an integer, ε > 0, δ > 0, K = K r / γ), V 0 B K X ). Let 7 More generally, B could be any Banach-space. p 3 d, ε, δ, K, K p, L p, B 0, A, γ) = ) 2 K + 52K 2 Kp 2 log 8 + log B 0 + ) ε γ) 6K + ) 2 ) L p d + log A + d log + ε γ) ) ) + log. δ If p 3 d, ε, δ, K, K p, L p, B 0, A, γ) then { P max max ˆT a T t V 0 T a T t V 0, ) ˆT } V t 0 T t V 0 ε δ. 7) Proof. Let V s = T s V 0, 0 s t, B 0 = {V 0, V,..., V t }. By Proposition 5.9, B 0 B K X ). By Lemma 5.7, if p 2 d, ε γ), δ, K, K p, L p, B 0, A, γ) then P max max ˆT ) a V T a V ε γ) δ. V B 0 Let the elementary random event ω be such that max max ˆT a ω)v T a V ε γ). V B 0 If we show that max ˆT a ω)t t V 0 T a T t V 0, max ˆT ω) t V 0 T t V 0 ) ε then the proof will be finished. Obviously, max 8) ˆT a ω)t t V 0 T a T t V 0 ε 9) by the construction of B 0 and since γ. ow, note that
10 0 ˆT ω)v T V max ˆT a ω)v T a V holds for all V BX ). Since by the choice of ω and, max ˆT a ω)t s V 0 T a T s V 0 ε γ), 0 s t, we also have ˆT ω)t s V 0 T T s V 0 ε γ), 0 s t. Moreover, since ˆT ω) Lip γ), Proposition 5.8 can be applied with the choice B = BX ), T = ˆT ω), T 2 = T and V = V 0, yielding ˆT ω) t V 0 T t V 0 ε. This together with 9) yields 8), thus proving the theorem Proving the ε-optimality of the Algorithm First, we prove an inequality similar to that of [4], but here we use both approximate value functions and approximate operators. Lemma 5.. Let V BX ), x : X for some > 0 and let π : X A be such that Then V π V 2 γ ˆT x : πv = ˆT x : V. max T a V ˆT x : av + γ T V V ). 20) ote that since A is finite, the policy defined in the lemma exists. Proof. We compare Tπ k V and T k V since these are known to converge to V π and V, respectively. Firstly, we write the difference Tπ k V T k V in the form of a telescoping sum: k Tπ k V T k V = T i+ π V TπV i ) i= k + T π V T V ) T i+ V T i V ). i= Using the triangle inequality, the relations T, T π Lip γ), and the inequality γ k + γ k γ γ/ γ), we get T k π V T k V γ T π V V γ + T V V ) + T π V T V. Using the identity ˆT x : πv = ˆT x : V, we write T π V T V = and thus 2) T π V ˆT ) ) x : πv + ˆTx : V T V T π V T V T π V ˆT x : πv + ˆT x : V T V 2 max T a V ˆT x : av. 22) On the other hand, T π V V T π V T V + T V V, and therefore by 2), T k π V T k V 2γ γ T V V ) γ + γ + T π V T V which combined with 22) yields T k π V T k V 2 γ T V V γ + max T a V ˆT x : av ). Taking the limes superior of both sides when k yields the lemma.
11 ote that if ˆTx : a = T a then we get back the tight bounds of [4]. 8 The next lemma exploits that if V t = ˆT t x V : 0 for some V 0 B K X ) then the Bellman-error T V t V t can be related to the quality of approximation of T a by ˆT x : a. Lemma 5.2. Let K = K r / γ), ε > 0 and let V 0 B K X ) fixed. Let log8k) + log/ε γ))) t = tε, γ, K) =, log/γ) proving the lemma. V π V ε, ow, we are in the position to prove the first main result that was stated as Theorem 4. before: Theorem 5.3. Let K = K r / γ) and let ε > 0, δ > 0, V 0 B K X ) be fixed. Let x : X, V t = ˆT t x : V 0 and assume that max T a V t ˆT x : av t ε γ) 4 + γ). 23) and let t = tε, γ, K) Further, let π : X A s.t. ˆTx : πv t = ˆT x : V t. Then π is ε-optimal, i.e., V π V ε. Proof. We use Lemma 5.. Let V = V t and let us bound the Bellman-error T V t V t first: T V t V t T V t ˆT x : V t + ˆT x : V t V t max T a V t ˆT x : av t t+ + ˆT V x : 0 ˆT t x V : 0. Since ˆTx : Lip γ), the second term is bounded by γ t ˆTx : V 0 V 0 γ t ˆTx : V 0 + V 0 ) 2K γ t, where we have used that ˆT x : : B K X ) B K X ) and V 0 B K X ). Therefore, by Lemma 5. we have V π V 2 + γ) + γ max 4K γt+ γ. T a V t ˆT x : av t Using the definition of t and 23) we get 8 ote that the lemma still holds if we replace the special operators ˆT x : a, ˆTx : π and ˆT x : by operators ˆT a, ˆT π, ˆT Lip γ) satisfying ˆT πv )x) = ˆT πx) V )x) and ˆT V )x) = max ˆT av )x). p 4 d, ε, δ, K, K p, L p, A, γ) = ) 2 24K + ) 52K 2 Kp 2 ε γ) 2 log 8 + log tε, γ, K) + ) 384K + ) 2 ) L p d + log A + d log ε γ) 2 + ) ) + log. δ 24) Let p 4 d, ε, δ, K, K p, L p, A, γ). Let V = ˆT t V 0 and let the stationary policy π be defined by ˆT π V = ˆT V. Then P V π V ε) δ. 25) Proof. The proof combines Lemmas 5.2 and 5.0. Firstly, we bound m = max Let π : X A be defined by ˆT t t a ˆT V 0 T a ˆT V 0. πx) = argmax ˆT a ˆT t V 0 T a ˆT t V 0 π does not depend on x). Then
12 2 m = ˆT t t π ˆT V 0 T π ˆT V 0 ˆT t π ˆT V 0 ˆT π T t V 0 + ˆT π T t V 0 T π T t V 0 + T π T t t V 0 T π ˆT V 0 2γ ˆT t V 0 T t V 0 + max ˆT a T t V 0 T a T t V 0 Therefore if 2γ + ) { max max ˆT a T t V 0 T a T t V 0, ˆT t V 0 T t V 0 }. p 3 d, ε γ)/42γ + )γ + )), δ, K, K p, L p, tε, γ, K), A, γ) then by Lemma 5.0 and Lemma 5.2, V π V ε with probability at least δ. In order to finish the proof of the main theorem we will prove that in discounted problems stochastic policies that generate ε-optimal actions with high probability are uniformly good. This result appears in the context of finite models in [9]. For completeness, we present the proof here. We start with the definition of ε-optimal actions and then prove three simple lemmas. Definition 5.4. Let ε > 0, and consider a discounted MDP X, A, p, r, γ). We call the set A ε x) = { a A : T a V )x) T V )x) ε }. the set of ε-optimal actions. Elements of this set are called ε-optimal. Lemma 5.5. Let π : X A [0, ] be a stationary stochastic policy that selects only ε-optimal actions: for all x X and a A from πx, a) > 0 it follows that a A ε x). Then V π V ε/ γ). Proof. From the definition of π it is immediate that T π V V ε. Clearly, T π V V and T π V )x) = πx, a)t a V )x) = ε x) εx) = V x) ε. πx, a)t a V )x) πx, a) T V )x) ε) ow, consider the telescoping sum k Tπ k V = T π V + T i+ π V TπV i ). Therefore, i= T k π V V k T π V V + T i+ π V TπV i ε + i= γ γ ε = ε γ. The next lemma will be applied to show if two policies are close to each other then so are their evaluation functions. Both the lemma and its proof are very similar to those of Proposition 5.8. Lemma 5.6. Let B be a space of bounded functions 9, B K { V B : V K }. Assume that T, T 2 : B K B K are such that for some α > 0 T V T 2 V α holds for all V B K and T Lip γ) for some 0 γ <. Then T s V T2 s V α/ γ). Further, let V be the fixed point of T and V2 be the fixed point of T 2. If T 2 Lip γ) then V V2 α/ γ). Proof. The proof is almost identical to that of Proposition 5.8. One proves by induction that T s V T s 2 V α/ γ) holds for all s 0. Here V B K is fixed. Indeed, the inequality holds for s = 0. Assuming that it holds for s with s one gets 9 Again, B could be any Banach-space.
13 3 T s V T2 s V T T s V T T2 s V + T T2 s V T 2 T2 s V γα/ γ) + α = α/ γ), showing the first part of the statement. The second part is proven by taking the limes superior of both sides when s. ow, we are ready to prove the lemma showing that policies that choose ε-optimal actions with high probability are uniformly good. Lemma 5.7. Let ε > 0, > δ > 0 be given. Let π : X A [0, ] be a stochastic policy that selects ε-optimal actions with probability at least δ. Then V π V ε + 2Kδ)/ γ). Proof. Let δx) = a A εx) πx, a) denote the probability of selecting non-ε-optimal actions in state x x X ). By assumption, δx) δ <. Let π : X A [0, ] be the policy defined by π x, a) = { πx,a) δx), if a A εx), 0, otherwise. We claim that T π and T π are close to each other. For, let V B K X ), where K = K r / γ). T π V )x) T π V )x) = πx, a) π x, a))t a V )x) and since T a V K, T π V T π V K Further, πx, a) π x, a) = ε x) + a A εx) πx, a) π x, a) πx, a) π x, a) πx, a) π x, a). = πx, a) πx, a) + πx, a) δx) ε x) a A ε x) = 2δx) 2δ. Therefore, T π V T π V 2Kδ. Since T π and T π and B K X ) satisfy the assumptions of Lemma 5.6, and the fixed point of T π and T π are V π and V π, respectively, we have V π V π 2Kδ/ γ). 26) Further, by construction π selects only ε-optimal actions and thus by Lemma 5.5, V π V ε/ γ). Combining this with 26), we get that V π V ε + 2Kδ)/ γ), finishing the proof. We are ready to prove the main result of the paper that was states earlier as Theorem 4.2: Theorem 5.8. Let K = K r / γ) and let ε > 0 Fix some V 0 B K X ). Let ε = ε γ)/2+γ)), δ = ε γ)/4k)= ε γ) 2 /4K r ) ). Further, let log32k/ε γ) t = tε 2 )), γ, K) = 27) log/γ) and let p 5 d, ε, K, K p, L p, A, γ) = ) 2 96K + ) 52K 2 Kp 2 ε γ) 3 log 8 + log t + ) 768K + ) 2 ) L p d + log A + d log ε γ) 3 + ) ) 4K + log. ε γ) Choose 28) p 5 d, ε, K, K p, L p, A, γ) 29) and let V = ˆT t V 0. Further, let the stochastic stationary policy π : X A [0, ] be defined by πx, a) = Pπ X : x) = a), 30) where π X : is the policy defined by ˆT πx : V = ˆT V. Then π is ε-optimal and given a state x, a random action of π can be computed in time and space polynomial in /ε, d, K, log L p, A and / γ).
14 4 Proof. The second part of the statement is immediate cf. Remark 3.2). The bound on the time of computation is Ot + ) 2 A ) 3) and the space requirement of the algorithm is 0 O + A ) 32) For the first part, fix X :. By Theorem 5.3, if V = V then V satisfies P V V πx : ε ) δ. We claim that if ω is such that V ω) V ε then π X : ω)a) A ε +γ)x). Let us pick up such an ω, let T = T ω); πx : note that V = T V. Then T V V T V T V + V V +γ)ε. Therefore, using the definition of A ε x), we get that π X : ω)a) A ε +γ)x). This shows that P π X : x) A ε +γ)x) ) δ. ow, by Lemma 5.7, the policy π defined by 30) is ε + γ) + 2Kδ)/ γ))-optimal, i.e., V π V ε + γ) + 2Kδ)/ γ). Substituting the definitions of ε and δ yields the result. 6. Conclusions and Further Work In this article we have considered an on-line planning algorithm that was shown to avoid the curse of dimensionality. Bounds following from Rust s original result by Markov s inequality were improved on in several ways: our bounds depend poly-logarithmically on the Lipschitz constant of the transition probabilities, they do not depend on the Lipschitz constant of the immediate rewards we dropped the assumption of having Lipschitzcontinuous immediate reward functions), and the number of samples depends on the cardinality of the action set in a poly-logarithmic way, as well. It is interesting to note that although our bounds depend poly-logarithmically on the Lipschitz constant of the transition probabilities char- 0 Assuming that only the normalization factors of the transition probabilities ˆp X : are stored. acterizing how fast the dynamics is), they depend polynomially on the bound of the transition probabilities characterizing the randomness of the MDP). Therefore, perhaps not surprisingly, for these kind of Monte-Carlo algorithms faster dynamics are easier to cope with than less random dynamics with peaky transition probability functions). As a consequence of our result, many interesting questions arise. For example, different variants of the proposed algorithm could be compared, such as multigrid versions, versions using quasi-random numbers, or versions that use importance sampling could be compared. In practice, one would probably choose not to recompute the cache C ε for each query. Also, in practice, one would probably precompute the transition probability table ˆp X : X i X j, a) and in order to speed-up the iterations one would probably eliminate the computation with those transition probability values that are very close to zero. This would considerably speed up the computations as one would expect that distant parts of the state space are uncoupled. However, the theoretical effect of these modifications needs to be explored. ote that the Lipschitz condition on p can be replaced by an appropriate condition on the metric-entropy of px, a) and the proofs will still go through. Therefore the proofs can be extended to Hölder-classes of transition laws or local Lipschitz classes e.g. px x, a) px x 2, a) Lx, a) x x 2 ) in this case one would need to use bracketing numbers), smooth functions, Sobolev classes, etc. One of the most interesting problems is to extend the results to infinite action spaces. For sure, such an extension needs some regularity assumptions on the dependence of the transition probability law and the reward function on the actions. It would also be interesting to prove analogous results for discrete MDPs having a factorized representation. The presented algorithm may find applications in economic problems without any modifications [2]. We also work on applications on deterministic continuous state-space, finite-action space control problems and partially observable MDPs over discrete spaces. Also, a combination with look-a-head search can be interesting from the practical point of view. The algorithm considered in the article was tried in practice on some standard problems car-on-
15 5 the-hill, acrobot) and it was observed to yield a reasonable performance even when the number of samples was kept quite small in the range of a few hundred to few thousand samples). It was also observed that boundary effects can interfere negatively with the algorithm. Details of these experiments, however, will be described elsewhere. References [] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, ew Jersey, 957. [2] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, J, USA, 989. [3] C.S. Chow and J.. Tsitsiklis. The complexity of dynamic programming. Journal of complexity, 5: , 989. [4] C.S. Chow and J.. Tsitsiklis. An optimal multigrid algorithm for continuous state discrete time stochastic control. IEEE Transactions on Automatic Control, 368):898 94, 99. [5] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Applications of Mathematics: Stochastic Modelling and Applied Probability. Springer-Verlag ew York, 996. [6] E.B. Dynkin and A.A. Yushkevich. Controlled Markov Processes. Springer-Verlag, Berlin, 979. [7] G.S. Fishman. Monte Carlo Concepts, Algorithms, and Applications. Springer-Verlag, 999. [8] M. Kearns, Y. Mansour, and A.Y. g. Approximate planning in large POMDPs via reusable trajectories. In S. A. Solla, T. K. Leen, and K. R. Müller, editors, Advances in eural Information Processing Systems 2. MIT Press, Cambridge, MA, 999. to appear. [9] M. Kearns, Y. Mansour, and A.Y. g. A sparse sampling algorithm for near-optimal planning in large Markovian decision processes. In Proceedings of IJ- CAI 99, 999. [0] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, ew York, 984. [] M.L. Puterman. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., ew York, Y, 994. [2] J. Rust. Structural estimation of Markov decision processes. In Handbook of Econometrics, volume 4, chapter 5, pages orth Holland, 994. [3] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65:487 56, 996. [4] R. J. Williams and L.C. Baird, III. Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 994.
Markov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationValue Function Based Reinforcement Learning in Changing Markovian Environments
Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji
More informationTotal Expected Discounted Reward MDPs: Existence of Optimal Policies
Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationExtended dynamic programming: technical details
A Extended dynamic programming: technical details The extended dynamic programming algorithm is given by Algorithm 2. Algorithm 2 Extended dynamic programming for finding an optimistic policy transition
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationInfinite-Horizon Discounted Markov Decision Processes
Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected
More informationNotes on Tabular Methods
Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationUNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.
Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison,
More informationINTRODUCTION TO MARKOV DECISION PROCESSES
INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationMulti-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays
Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More information1 Stochastic Dynamic Programming
1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationValue Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes
Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationAPPROXIMATE ISOMETRIES ON FINITE-DIMENSIONAL NORMED SPACES
APPROXIMATE ISOMETRIES ON FINITE-DIMENSIONAL NORMED SPACES S. J. DILWORTH Abstract. Every ε-isometry u between real normed spaces of the same finite dimension which maps the origin to the origin may by
More informationWeighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications
May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationApproximate dynamic programming for stochastic reachability
Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic
More informationBasic Deterministic Dynamic Programming
Basic Deterministic Dynamic Programming Timothy Kam School of Economics & CAMA Australian National University ECON8022, This version March 17, 2008 Motivation What do we do? Outline Deterministic IHDP
More informationApproximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.
An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationSimplex Algorithm for Countable-state Discounted Markov Decision Processes
Simplex Algorithm for Countable-state Discounted Markov Decision Processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith November 16, 2014 Abstract We consider discounted Markov Decision
More informationLoss Bounds for Uncertain Transition Probabilities in Markov Decision Processes
Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov
More informationAn iterative procedure for constructing subsolutions of discrete-time optimal control problems
An iterative procedure for constructing subsolutions of discrete-time optimal control problems Markus Fischer version of November, 2011 Abstract An iterative procedure for constructing subsolutions of
More information1 Markov decision processes
2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe
More informationChapter 16 Planning Based on Markov Decision Processes
Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until
More informationINTRODUCTION TO MARKOV CHAIN MONTE CARLO
INTRODUCTION TO MARKOV CHAIN MONTE CARLO 1. Introduction: MCMC In its simplest incarnation, the Monte Carlo method is nothing more than a computerbased exploitation of the Law of Large Numbers to estimate
More informationON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS
Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)
More informationA lower bound for scheduling of unit jobs with immediate decision on parallel machines
A lower bound for scheduling of unit jobs with immediate decision on parallel machines Tomáš Ebenlendr Jiří Sgall Abstract Consider scheduling of unit jobs with release times and deadlines on m identical
More informationJournal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes
Journal of Computer and System Sciences 74 (2008) 1309 1331 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.elsevier.com/locate/jcss An analysis of model-based Interval
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationMarkov decision processes and interval Markov chains: exploiting the connection
Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic
More information21 Markov Decision Processes
2 Markov Decision Processes Chapter 6 introduced Markov chains and their analysis. Most of the chapter was devoted to discrete time Markov chains, i.e., Markov chains that are observed only at discrete
More informationThe Optimal Stopping of Markov Chain and Recursive Solution of Poisson and Bellman Equations
The Optimal Stopping of Markov Chain and Recursive Solution of Poisson and Bellman Equations Isaac Sonin Dept. of Mathematics, Univ. of North Carolina at Charlotte, Charlotte, NC, 2822, USA imsonin@email.uncc.edu
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More informationTHE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS
THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS RALPH HOWARD DEPARTMENT OF MATHEMATICS UNIVERSITY OF SOUTH CAROLINA COLUMBIA, S.C. 29208, USA HOWARD@MATH.SC.EDU Abstract. This is an edited version of a
More informationDynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor)
Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor) Matija Vidmar February 7, 2018 1 Dynkin and π-systems Some basic
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationComputational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes
Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationRichard S. Palais Department of Mathematics Brandeis University Waltham, MA The Magic of Iteration
Richard S. Palais Department of Mathematics Brandeis University Waltham, MA 02254-9110 The Magic of Iteration Section 1 The subject of these notes is one of my favorites in all mathematics, and it s not
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationI. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm
I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm Proposition : B(K) is complete. f = sup f(x) x K Proof.
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationOn the Principle of Optimality for Nonstationary Deterministic Dynamic Programming
On the Principle of Optimality for Nonstationary Deterministic Dynamic Programming Takashi Kamihigashi January 15, 2007 Abstract This note studies a general nonstationary infinite-horizon optimization
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationRegularity for Poisson Equation
Regularity for Poisson Equation OcMountain Daylight Time. 4, 20 Intuitively, the solution u to the Poisson equation u= f () should have better regularity than the right hand side f. In particular one expects
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationRL 14: Simplifications of POMDPs
RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME
ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME SAUL D. JACKA AND ALEKSANDAR MIJATOVIĆ Abstract. We develop a general approach to the Policy Improvement Algorithm (PIA) for stochastic control problems
More informationPolynomial time Prediction Strategy with almost Optimal Mistake Probability
Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the
More informationLinearly-solvable Markov decision problems
Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu
More informationRandom Feature Maps for Dot Product Kernels Supplementary Material
Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains
More information7: FOURIER SERIES STEVEN HEILMAN
7: FOURIER SERIES STEVE HEILMA Contents 1. Review 1 2. Introduction 1 3. Periodic Functions 2 4. Inner Products on Periodic Functions 3 5. Trigonometric Polynomials 5 6. Periodic Convolutions 7 7. Fourier
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationarxiv: v1 [math.oc] 9 Oct 2018
A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces Insoon Yang arxiv:1810.03847v1 [math.oc] 9 Oct 2018 Abstract A convex optimization-based method is proposed to
More informationMetric Spaces and Topology
Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationFinding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter
Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationAn Analysis of Model-Based Interval Estimation for Markov Decision Processes
An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University
More informationOptimal Control. McGill COMP 765 Oct 3 rd, 2017
Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps
More informationStochastic Safest and Shortest Path Problems
Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012 Path optimization under probabilistic uncertainties Problems coming to searching for
More informationOn the static assignment to parallel servers
On the static assignment to parallel servers Ger Koole Vrije Universiteit Faculty of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The Netherlands Email: koole@cs.vu.nl, Url: www.cs.vu.nl/
More informationSample width for multi-category classifiers
R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationOn Finding Optimal Policies for Markovian Decision Processes Using Simulation
On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation
More informationMarkov Decision Processes and Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationHilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality
(October 29, 2016) Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/notes 2016-17/03 hsp.pdf] Hilbert spaces are
More information