Efficient Approximate Planning in Continuous Space Markovian Decision Problems

Size: px
Start display at page:

Download "Efficient Approximate Planning in Continuous Space Markovian Decision Problems"

Transcription

1 Efficient Approximate Planning in Continuous Space Markovian Decision Problems Csaba Szepesvári a,, a Mindmaker Ltd. Budapest HUGARY - 2 Konkoly-Thege M. u szepes@mindmaker.hu Monte-Carlo planning algorithms for planning in continuous state-space, discounted Markovian Decision Problems MDPs) having a smooth transition law and a finite action space are considered. We prove various polynomial complexity results for the considered algorithms, improving upon several known bounds. Keywords: Markovian Decision Problems, planning, value iteration, Monte-Carlo algorithms. Introduction MDPs provide a clean and simple, yet fairly rich framework for studying various aspects of intelligence, such as planning. A well-known practical limitation of planning in MDPs is called the curse of dimensionality [], referring to the exponential rise in the resources required to compute even approximate) solutions to an MDP as the size of the MDP the number of state variables) increases. For example, conventional dynamic programming DP) algorithms, such as value- or policy-iteration scale exponentially with the size, even if they are combined with sophisticated multigrid algorithms [4]. Moreover, the curse of dimensionality is not specific to any algorithm, as shown by a result of Chow and Tsitsiklis [3]. Recently, Kearns et al. have shown that a certain on-line, tree building algorithm avoids the curse of dimensionality in discounted MDPs [9]. Recently, this result has been extended to partially observable MDPs POMDPs) by the same authors [8]. The bounds in these two papers are independent of the size of the state space, but scale exponentially with γ, the effective horizon-time, where γ is the discount factor of the MDP. In this paper we consider another on-line planning algorithm that will be shown to scale polynomially with the horizon-time, as well. The price of this is that we have to assume more regularity on the MDPs we consider. In particular, we will restrict ourselves to stochastic MDPs with finite action spaces and state space X = [0, ] d, and, more importantly, assume that the transition probability kernel of the MDPs are subject to the Lipschitzcondition px x, a) px x 2, a) L p x x 2 for any states x, x 2, x [0, ] d and action a A. Here L p > 0 is a given fixed number and denotes the l norm of vectors. Another restriction quite common in the literature) that we will assume is the uniform boundedness of the transition probabilities the bound shall be denoted by K p ) and of the immediate rewards bound denoted by K r ). Further, our bounds will depend on the dimension of the state space, d. The idea of the considered algorithms originates in the algorithm considered by Rust [3]. 2 Rust studied a more restricted class of problems than considered in this paper and proved the following result. First, let us define the concept of ε-optimality in the mean. Fix an MDP with state space X. A random, real-valued function ˆV with domain X ] is called ε-optimal in the mean if E[ ˆV V ε, where V is the optimal value function underlying the selected MDP and is the maximum-norm and the expectation is The bounds developed by Kearns et. al do not exhibit any dependence on the state space. 2 The algorithm will be given in the next section. AI Communications ISS , IOS Press. All rights reserved

2 2 taken for the random function ˆV. The input of the algorithm is a tolerance number, ε > 0. Given any ε > 0, the algorithm first builds up a random) cache C ε. Then, given a query state x X and the cache C ε, the algorithm draws a sample of a random function ˆV x), ˆV being ε-optimal in the mean. Rust has shown that both phases of the algorithm are polynomial in A, K r /ε γ)), L p, L r, d, K p. Here L r is the Lipschitz factor of the immediate rewards. ote that Rust s bound scales polynomially with the effective horizon-time, so our approach will be to extend his algorithm to planning. The very first idea along this way is to make use of Markov s inequality. The algorithm based on this idea would work as follows: Fix the random sample and consider ˆV as given by Rust s algorithm, and a state x. Using Markov s inequality one gets that P ˆV V δ) ε/δ. ow, imagine that we can compute argmax {rx, a) + γ py x, a) ˆV y) } dy. A contraction argument would then show that drawing = polyk r /εδ), L p, L r, A, d, K r, K p, / γ)) samples is sufficient for ensuring the ε-optimality of π with probability at least δ. ow, the A integrals can themselves be approximated by Monte- Carlo methods. 3 The computational complexity of the resulting algorithm will depend polynomially on and will thus scale polynomially with L r and /δ. There are a number of methods to boost the polynomial dependence on /δ to log/δ). Here, we are going to use maximal inequalities to arrive at such a result. This method will have the additional benefit that we can get rid of the Lipschitzian condition regarding the immediate rewards and boost the polynomial dependence of the complexity bounds on L p to a poly-logarithmic one. Interestingly, our bound for the number of samples will be poly-logarithmic in the size of the action space, as well. ote, however, the the com- 3 One might either want to reuse the samples drawn earlier or draw new samples. The second approach is easier to analyze, whilst the first one may appear more elegant for some. plexity bounds will still scale polynomially with the size of the action space. We will also derive novel bounds for the complexity of calculating uniformly optimal policies. The organization of the paper is as follows: In Section 2 we provide the necessary background. The algorithm is given in Section 3, the main result of the paper is formulated in Section 4. The proof of the main result is given in Section 5, and conclusions are drawn in Section Preliminaries We assume that the reader is familiar with the basics of the theory of MDPs. Readers who lack the necessary background are referred to the book of Dynkin and Yuskevich [6] or the more recent books [2] and []. 2.. otation Let p [, +]. p refers to the l p norm of vectors and the L p norm of functions, depending on the type of its argument. Lip p denotes the set of mappings that are Lipschitz-continuous in the norm p : f Lip p means that there exists a positive constant L > 0 s.t. fx) fy) p L x y p domains of the mappings are suppressed). L is called the p -Lipschitz constant of f. Lip p γ) Lip p denotes the set of mappings whose p -Lipschitz constant is not larger than γ. A mapping T is called a contraction in the norm p if T Lip p γ) for some 0 γ <. Let V be any set, T : V V and S : V V. Then the mapping T S : V V is defined by T S)v = T Sv), v V. The set of natural numbers will be denoted by, the set of reals by R. If t then T t denotes the map that is the product of T with itself t-times. We say that T = S iff T v = Sv holds for all v V. ω will in general denote an elementary event of the probability space under consideration, lhs means left-hand-side, and rhs means right-hand-side. We define BX ) to be the set of all bounded realvalued function over X : BX ) = {f : X R : f < +, f is measurable}. Further, for any K > 0, B K X ) shall denote the set of all bounded functions whose maximum-norm is below the constant K: B K X ) = { f BX ) : f < K }.

3 3 Table Pseudo-code of the algorithm 0. Input: x X query state), ε > 0 tolerance), p, r, γ, A model parameters).. Compute t and as defined in Theorem Draw X,..., X independent samples uniformly distributed over X. 3. Compute ˆp X : X i X j, a) i, j ) using ˆp x : x i x, a) = px i x, a)/ P j= px j x, a) if P j= px j x, a) > 0, and let ˆp x : x i x, a) = 0 otherwise. 4. Let v i = 0, i. 5. Repeat t times: v i := max {rx i, a) + γ P j= ˆp X : X j X i, a) v j }, i. 6. Let a = argmax {rx, a) + γ P j= ˆp X : X j x, a) v j }. 7. Return a The Model Let us consider the continuous space discounted MDP given by X, A, p, r, γ), where X = [0, ] d d > 0, d ) is the state space, A is the action space, p is a measurable transition density: px x, a) 0 and px x, a)dx = x, a) X A, r : X A R is a measurable function, called the reward function and 0 γ < is the discount factor. We further assume the followings: Assumption 2.. A is finite. Assumption 2.2. There exist constants K p, L p > 0 s.t. p K p and py, a) Lip L p ) for all y, a) X A. Assumption 2.3. There exists some constant K r > 0 s.t. r < K r. 3. The Algorithm The pseudo-code of the algorithm yielding uniformly approximately optimal policies can be seen in Table. ote that at the expense of increasing the computation time one may downscale the storage requirement of the algorithm from O 2 ) to O) if Step 3 of the algorithm is omitted. Then Equation 2) must be used in Steps 5 and 6. ote that one may still precompute the normalizing factor of 2) for speeding up the computations since the storage requirements for these normalizing factors depend only linearly on. Rust s original algorithm builds up the cache C ε = v,..., v ) using steps 5 with some and t. Then, for any query state x X his algorithm returns the random value ˆV x) = max {rx, a) + γ j= ˆp X : X j x, a) v j }. It can be readily seen that our algorithm is just a straightforward extension of the one considered by Rust, the difficulty lies in deriving appropriate bounds for and t. ow, we introduce the notations needed to state the main results. Let T a : BX ) BX ) be defined by T a V )x) = rx, a) + γ py x, a)v y)dy. Here a A is arbitrary and the integral should be understood here and in what follows to be over X. For a stationary policy π : X A, let T π : BX ) BX ) be defined by T π V )x) = T πx) V )x). Finally, let the Bellman-operator T : BX ) BX ) be defined by T V )x) = max {T av )x)}. Under our assumptions, T is known to have a unique fixed-point, V, called the optimal-value function. V is known to be uniformly bounded. It is also known that any stationary) policy π : X A satisfying T π V = T V is optimal in the sense that for any given initial state the total expected discounted return resulting from the execution of π is maximal. The execution of a policy π : X A means the execution of action πx) whenever the state is x.) A policy is called myopic or greedy w.r.t. the function V BX ) if T π V = T V. Since in our case the action set A is finite, the existence of a myopic policy is guaranteed for any given uniformly bounded function V. ow let x,..., x X be fixed elements of the state space. For brevity, let us denote the -tuple

4 4 x,..., x ) by x :. Let ˆT x : a : BX ) BX ) be defined by ˆT x : av )x) = rx, a) + γ where ˆp x : x i x, a) = { pxi x,a) P j= pxj x,a); ˆp x : x i x, a)v x i ), i= 0; otherwise. if j= px j x, a) > 0, ) 2) The operator ˆT x : a is obtained from T a by approximating the integrals in T a by finite sums. It should be clear that because of the Lipschitz conditions on p, ˆT x : a does approximate T a and the quality of approximation depends on the distribution of the points x :. Using ˆT x : a we introduce the operator ˆT x : that is meant to approximate T. It is defined as follows: ˆTx : : BX ) BX ), and ˆT x : V )x) = max { ˆT x : av )x)}. 3) ow, analogeously with the previous definitions, ˆT x : π is introduced by ˆT x : πv )x) = ˆT x : πx)v )x). Throughout the paper we are going to work with independent random variables X,..., X, being uniformly distributed over X. 4 Similarly to the notation introduced for deterministic -tuples of state space points, X : will be used to denote X,..., X ). We define the random operators ˆT a, ˆT π and ˆT by the respective equations ˆT a = ˆT X : a, ˆT = ˆT X :. ˆT π = ˆT X : π, and 4 The uniform distribution is used for simplicity only. Any other sampling distribution with support covering X could be used if the algorithm is modified appropriately importance sampling) [7]. The form of the ideal sampling distribution is far from being clear since a single sample-set is used to estimate an infinite number of integrals. The form of the ideal distribution should be the subject of future research. Here ˆT is called the random Bellman-operator. A great deal of effort in this paper will be devited to show that ˆT and its powers approximate the true Bellman-operator T and its respective powers uniformly well, with high probability. In order to connect the algorithm with the operators defined so far, let us introduce the projection operator ˆP x : : BX ) R defined by ˆP x : V ) = V x ),..., V x n )), and the expansion operators Ê x : a, Êx: : R BX ) defined by the respective equations Êx : av)x) = rx, a) + γ ˆp x : x j x, a)vx j ), a A, and j= Êx: v)x) = max {Êx: av)x)}. Finally, let the finite state-space Bellman operator ˆL x : : R R be defined by ˆL x : v) i = max {rx i, a)+γ ˆp x : x j x i, a)v j }. j= The following proposition highlights the connection between the algorithm and these operators: Proposition 3.. For any integer t > 0, and in particular, Proof. By inspection. ˆT t+ x : = Êx : ˆL t x : ˆP x : ˆT t+ = ÊX : ˆL t X : ˆP X :. Remark 3.2. According to Proposition 3., one t+ can compute ˆT V )x) in two phases, the first of which we could call the off-line phase and the second of which we could call the on-line phase. In the off-line phase one computes the - dimensional vector v t) = ˆL t ˆP X : X : V, which takes Ot 2 A ) time, whilst in the second phase t+ one computes the value of ˆT V )x) by evaluating V )x) = ÊX: vt) )x). This second ˆT t+

5 5 step takes O 2 A ) time and thus the whole procedure takes Ot 2 A ) time. Further, it is easy to see that the procedure takes O + A ) space. 5 ow, the algorithm whose pseudocode was given above can be formulated as follows: Assume that we are given a fixed tolerance, ε > 0. On the basis of ε and L p, K r, A, γ we compute some integer t > 0 and another integer > 0. Each time we need to compute an action of the randomized policy π for some state x, we draw a random sample X : and compute v t) = ˆL t X : ˆP X : V 0 where V 0 x) = 0. Then a random action of πx) is computed by evaluating argmax ÊX : av t) )x). 4) The action of the argmax operator is returned. The resulting policy will be shown to be ε-optimal. Another, computationally less expensive method is to hold the random sample X : fixed and compute v t) only once. Then the computation of πx) using 4) costs only O A 2 ) steps. 4. Results The first result that we will prove shows that the algorithm just described at the end of the previous section yields uniformly approximately optimal policies with high probability and has polynomial complexity: Theorem 4.. Let K = K r / γ) and let ε > 0, δ > 0, V 0 B K X ) be fixed. Let t = tε, γ, K), where log8k) + log/ε γ))) tε, γ, K) = log/γ) and let 52K 2 K 2 p ) 2 24K + ) ε γ) 2 log 8 + log tε, γ, K) + ) + log A Let V = ˆT t V 0 and let the stationary policy π be defined by ˆT π V = ˆT V. Then P V π V ε) δ. Further, the complexity of the algorithm is polynomial in d, ε, K, K p, logl p ), A and / γ). ote that ideally the bound on should depend only on K/ε, so that scaling the rewards would not change the complexity results. The bound given in the above theorem does not have this property, it has some K s without a corresponding ε. The cause of this will become clear during the course of the proof of this theorem and, more specifically, in the proof of Lemma 5.7. ote that if ε is sufficiently small, an upper bound on the above expression can always be derived by replacing K by K/ε at those occurancies of K that lack a corresponding ε term. This way one gets a less tight, but in some sense) a better behaving bound. The next result shows that the modified, fully on-line algorithm given in Table yields a uniformly approximately optimal policy and has polynomial complexity. The above comments on scaling the rewards apply to this result, too. Theorem 4.2. Let K = K r / γ) and let ε > 0 Fix some V 0 B K X ). Let ε = ε γ)/2+γ)), δ = ε γ)/4k)= ε γ) 2 /4K r ) ). Further, let t = tε, γ, K) and let be the smallest integer larger than 52K 2 K 2 p ) 2 96K + ) ε γ) 3 log 8 + log t + ) + log A 768K + ) 2 ) L p d + d log ε γ) 3 + ) ) 4K + log. ε γ) 384K + ) 2 L p d + d log ε γ) 2 ) + + log 5 Here we assume that the basic algebraic operations over reals take O) time and that the storage of a real-number takes O) space. We also assume that ˆp X : is not stored. ) ). δ Let V = ˆT t V 0 and let the stochastic stationary policy π : X A [0, ] be defined by πx, a) = Pπ X : x) = a), where π X : is the policy defined by ˆT V = ˆT πx : V. Then, π is ε-optimal and given a state x, a random action of π can be computed in time and space polynomial in K r /ε, d, K r, K p, log L p, A and / γ).

6 6 The rough outlines of the proofs of these theorems are as follows: Under our assumptions, Pollard s maximal inequality cf. [0]) ensures that for any given fixed function V 0, ˆT V 0 T V 0 is small with high probability. 6 Using the triangle inequality one reduces the comparison of ˆT n V 0 and T n V 0 to those of ˆT T k V 0 and T T k V 0, where k varies from zero to n. More precisely, one shows that if the differences between ˆT T k V 0 and T T k V 0 are small for all k = 0,..., n then ˆT n V 0 T n V 0 will be small, too. Using this result, it is then easy to prove a maximal inequality for ˆT n V 0 T n V 0. ow, one can use standard contraction arguments to prove an inequality that bounds the difference of the value of a policy that is approximately greedy w.r.t. some function V in terms of the Bellman-residuals see e.g. [4]). The plan is to use this inequality for V = ˆT n V 0 and ˆT. Some more calculations yield Theorem 4.. Then, it is proven that if a policy selects only good actions i.e., actions from A ε x) = { a A : T a V )x) T V )x) ε } for a suitable ε) then it is good itself i.e., close to optimal). ext, we relax the condition of selecting good actions to selecting good actions with high probability. Such policies can be shown to be good, as well cf. Lemma 5 of [9]). Finally, it is shown that if a policy is good with high probability then it selects good actions with high probability and thus, in turn, it must be good. This will finish the proof of Theorem 4.2. One source of the complexity of the proof stems from the fact that Pollard s inequality cannot be used in a simple way to bound ˆT n V 0 T n V 0. This is because the usual induction argument that would bound ˆT n V 0 T n V 0 based on a n bound on ˆT V 0 T n V 0 does not quite work here. Typically, one argues that if ˆT approximates T uniformly well over the space of bounded functions or some space of functions of interest) then ˆT n V 0 T n V 0 will be small if 6 We must rely on Pollard s maximal inequality instead of the simpler Chernoff-bounds because the state space is continuous and the sup-norm above involves a supremum over the state space. Further, this result is derived in two steps, using an idea of Rust [3]. n ˆT V 0 T n V 0 is small. Unfortunately, the space of all bounded functions is just too rich in our case: ˆT cannot approximate T uniformly well over this rather complex space. A smaller, but still appropriate space F is needed - hence the complicated proof. 5. Proof We prove the theorem in the next three sections. First, we prove some maximal inequalities for the random Bellman-operators ˆT a. ext we show how these can be extended to powers of ˆT and, finally, we apply all these to prove the main results. 5.. Maximal Inequalities for Random Bellman Operators We shall need some auxilliary operators which are easier to deal with using probability theorey. Let T a : BX ) BX ) T a V )x) = rx, a) + γ T : BX ) BX ) px i x, a)v X i ); i= T V )x) = max { T a V )x)}. Operator T a is a simple Monte-Carlo estimate of operator T a and will be shown to converge uniformly to T a using standard methods. Unfortunately, T a is not suitable for further analysis as it can be a non-contraction, and in order to analyze the iterations in our algorithms, the contraction property of the approximate Bellman operators will be needed. Hence the algorithms use ˆT a and in a second step T a will be related to ˆT a, and the approximation results will be extended to ˆT a. We need some definitions and results from the theory of uniform deviations cf. [0]). Definition 5.. Let A R d. The set S A is an ε-cover of A if for all t A there exists an element s of S s.t. d t s ε. The set of ε-covers of A will be denoted CA; ε).

7 7 Definition 5.2. The ε-covering number of a set A is defined by ε, A) = min{ S : S CA; ε) }. The number log ε, A) is called the metric entropy of A. Let z :n = z,..., z n ) R d ) n and let F R Rd. We define Fz :n ) = { fz ),..., fz n )) : f F } R n. 5) The following theorem is due to Pollard see [0]): Theorem 5.3 Pollard, 984). Let n > 0 be an integer, ε > 0, M > 0, F [0, M] Rd be a set of measurable functions. Let X,..., X n R d be i.i.d. random variables. Then ) n P sup fx i ) E[fX )] f F n > ε i= 6) [ ε 8E ))] 8, FX:n) e nε2 28M 2. An elegant proof of this theorem can be found in [5][pp. 492]. In general, some further assumptions are needed to make the result of the above sup measurable. Measurability problems, however, are now well understood so we shall not worry about this detail. Readers who keep worrying should take all the probability bounds except for the main result as outer/inner-probability bounds whichever is appropriate). ote that in the final result we work with measurable sets and therefore there is no need to refer to outer/inner probability measures. Firstly, we extend this theorem to functions mapping R d into [ M, M]. Corollary Let n > 0 be an integer, ε > 0, M > 0, F [ M, M] Rd be a set of measurable functions. Let X,..., X n R d be i.i.d. random variables. Then ) n P sup fx i ) E[fX )] f F n > ε i= 7) [ ε 8E ))] 8, FX:n) e nε2 52M 2. Proof. Apply Theorem 5.3 to f M = f + M. Definition 5.4. Let d, d > 0 and let σ > 0. Let { Gridσ) = 2i σ,..., 2i d σ) [0, ] d : } 0 i k, k d, i k 2σ and let P σ : [0, ] d Gridσ) be defined by P σ x = argmin y { x y : y Gridσ) } where ties are broken in favor of points having smaller coordinates. Remark 5.5. x P σ x d σ and Gridσ) 2σ + ) d ow we can prove our first result concerning the approximation of T a by T a. Lemma 5.6. Let K > 0, ε > 0 and δ > 0. Further, let B 0 B K X ) be a finite set, p d, ε, δ, K, K p, L p, B 0, A, γ) = 52K 2 Kp 2 ε 2 log 8 + log B 0 + log A 6KLp d + d log ε ) + + log ) ) δ and p d, ε, δ, K, K p, L p, B 0, A, γ). Then 8) P max max T ) a V T a V > ε δ. 9) V B 0 Proof. We shall make use of Corollary Let Fx : ) = { z V, x, a) : V B 0, a A, x X }, where z V, x, a) = V X )px x, a),..., V X )px x, a)). Easily, z V, x, a) [ K K p, K K p ]. In order to bound ε, FX : )) from above, we construct an ε-cover of FX : ). We claim that S σ = { z V, x, a) : V B 0, a A, x Gridσ) } is an ε-cover of FX : ) if σ is chosen appropriately. In order to prove this let us pick up an arbitrary element z V, x, a) of FX : ). Then

8 8 z V, x, a) z V, P σ x, a) = V X i )px i x, a) V X i )px i P σ x, a) i= V px i x, a) px i P σ x, a) i= KL p d σ. Therefore, if σ = ε/kl p d) then S σ is an ε-cover of FX : ). By Remark 5.5, ε, FX : )) d B 0 A + ). By Corollary 5.3., if 2KLp d ε log 8 + log B 0 + log A ) 6KLp d + d log + + log δ ε ε 2 52K 2 Kp 2 then 9) holds. ow, we shall prove a similar result for ˆT a, using ideas from the proof of the Corollary to Theorem 3.4 of [3]. Lemma 5.7. Let K > 0, ε > 0 and δ > 0. Further, let B 0 B K X ) be a finite set, p 2 d, ε, δ, K, K p, L p, B 0, A, γ) = ) 2 K + 52K 2 Kp 2 log 8 + log B 0 + ) ε 6K + ) 2 ) L p d + log A + d log + ε ) ) + log δ If p 2 d, ε, δ, K, K p, L p, B 0, A, γ) then 0) P max max ˆT ) a V T a V > ε δ. ) V B 0 Proof. Let us pick up some V B 0. By the triangle inequality ˆT a V T a V Let ˆT a V T a V + T a V T a V. 2) p x, a) = px i x, a). i= If p x, a) = 0 then ˆT a V )x) T a V )x) = 0. If p x, a) 0 then by simple algebraic manipulations we get ˆT a V )x) T a V )x) = γ p x, a) p x, a) px i x, a)v X i ). i= Since, by assumption V X i ) K, we have ˆT a V )x) T a V )x) γk p x, a). 3) Let e : X R be defined by ex) = and observe that γ p x, a)) = T a e)x) T a e)x) and therefore by 3) we have ˆT a V )x) T a V )x) K T a e)x) T a e)x). ote that this inequality holds also when p x, a) = 0. Taking the supremum over X yields ˆT a V T a V K T a e T a e. By 2) we have ˆT a V T a V K T a e T a e + T a V T a V K + ) max T a V T a V. V B 0 {e} Therefore max max ˆT a V T a V V B 0 K + ) max max T a V T a V V B 0 {e}. ow, the statement of the lemma follows using Lemma 5.6 with the choice p d, ε/k + ), δ, K +, K p, L p, B 0 +, A, γ).

9 Maximal Inequalities for Powers of Random Bellman Operators First we need a proposition that relates the fixed point of a contraction operator and an operator that is approximating the contraction. Proposition 5.8. Let B be a space of bounded functions 7, and fix some V B and integer t > 0. Let T, T 2 : B B be operators of B such that T Lip γ) for some 0 γ < and T T s 2 V T 2 T s 2 V α, 0 s t 4) for some α > 0. Then T t V T2V t α γ. 5) Proof. We prove the statement by induction; namely, we prove that T s V T s 2 V α γ 6) holds for all 0 s t. The statement is obvious for s = 0. Assume that we have already proven 6) for s. By the triangle inequality, T s V T2 s V T T s V T T2 s V + T T2 s V T 2 T2 s V. Since T Lip γ), the first term of the rhs can be bounded by γ T s V T2 s V, which in turn can be bounded by γα/ γ), by the induction hypothesis. The second term, on the other hand, can be bounded by α, by 4). Since γα/ γ) + α = α/ γ), inequality 6) holds for s as well, thus proving the proposition. We cite the next proposition without proof, as the proof is both elementary and is well known. Proposition 5.9. Let K = K r / γ). Then the Bellman-operator T maps B K X ) into B K X ). ow follows the main result of this section. Lemma 5.0. Let t > 0 be an integer, ε > 0, δ > 0, K = K r / γ), V 0 B K X ). Let 7 More generally, B could be any Banach-space. p 3 d, ε, δ, K, K p, L p, B 0, A, γ) = ) 2 K + 52K 2 Kp 2 log 8 + log B 0 + ) ε γ) 6K + ) 2 ) L p d + log A + d log + ε γ) ) ) + log. δ If p 3 d, ε, δ, K, K p, L p, B 0, A, γ) then { P max max ˆT a T t V 0 T a T t V 0, ) ˆT } V t 0 T t V 0 ε δ. 7) Proof. Let V s = T s V 0, 0 s t, B 0 = {V 0, V,..., V t }. By Proposition 5.9, B 0 B K X ). By Lemma 5.7, if p 2 d, ε γ), δ, K, K p, L p, B 0, A, γ) then P max max ˆT ) a V T a V ε γ) δ. V B 0 Let the elementary random event ω be such that max max ˆT a ω)v T a V ε γ). V B 0 If we show that max ˆT a ω)t t V 0 T a T t V 0, max ˆT ω) t V 0 T t V 0 ) ε then the proof will be finished. Obviously, max 8) ˆT a ω)t t V 0 T a T t V 0 ε 9) by the construction of B 0 and since γ. ow, note that

10 0 ˆT ω)v T V max ˆT a ω)v T a V holds for all V BX ). Since by the choice of ω and, max ˆT a ω)t s V 0 T a T s V 0 ε γ), 0 s t, we also have ˆT ω)t s V 0 T T s V 0 ε γ), 0 s t. Moreover, since ˆT ω) Lip γ), Proposition 5.8 can be applied with the choice B = BX ), T = ˆT ω), T 2 = T and V = V 0, yielding ˆT ω) t V 0 T t V 0 ε. This together with 9) yields 8), thus proving the theorem Proving the ε-optimality of the Algorithm First, we prove an inequality similar to that of [4], but here we use both approximate value functions and approximate operators. Lemma 5.. Let V BX ), x : X for some > 0 and let π : X A be such that Then V π V 2 γ ˆT x : πv = ˆT x : V. max T a V ˆT x : av + γ T V V ). 20) ote that since A is finite, the policy defined in the lemma exists. Proof. We compare Tπ k V and T k V since these are known to converge to V π and V, respectively. Firstly, we write the difference Tπ k V T k V in the form of a telescoping sum: k Tπ k V T k V = T i+ π V TπV i ) i= k + T π V T V ) T i+ V T i V ). i= Using the triangle inequality, the relations T, T π Lip γ), and the inequality γ k + γ k γ γ/ γ), we get T k π V T k V γ T π V V γ + T V V ) + T π V T V. Using the identity ˆT x : πv = ˆT x : V, we write T π V T V = and thus 2) T π V ˆT ) ) x : πv + ˆTx : V T V T π V T V T π V ˆT x : πv + ˆT x : V T V 2 max T a V ˆT x : av. 22) On the other hand, T π V V T π V T V + T V V, and therefore by 2), T k π V T k V 2γ γ T V V ) γ + γ + T π V T V which combined with 22) yields T k π V T k V 2 γ T V V γ + max T a V ˆT x : av ). Taking the limes superior of both sides when k yields the lemma.

11 ote that if ˆTx : a = T a then we get back the tight bounds of [4]. 8 The next lemma exploits that if V t = ˆT t x V : 0 for some V 0 B K X ) then the Bellman-error T V t V t can be related to the quality of approximation of T a by ˆT x : a. Lemma 5.2. Let K = K r / γ), ε > 0 and let V 0 B K X ) fixed. Let log8k) + log/ε γ))) t = tε, γ, K) =, log/γ) proving the lemma. V π V ε, ow, we are in the position to prove the first main result that was stated as Theorem 4. before: Theorem 5.3. Let K = K r / γ) and let ε > 0, δ > 0, V 0 B K X ) be fixed. Let x : X, V t = ˆT t x : V 0 and assume that max T a V t ˆT x : av t ε γ) 4 + γ). 23) and let t = tε, γ, K) Further, let π : X A s.t. ˆTx : πv t = ˆT x : V t. Then π is ε-optimal, i.e., V π V ε. Proof. We use Lemma 5.. Let V = V t and let us bound the Bellman-error T V t V t first: T V t V t T V t ˆT x : V t + ˆT x : V t V t max T a V t ˆT x : av t t+ + ˆT V x : 0 ˆT t x V : 0. Since ˆTx : Lip γ), the second term is bounded by γ t ˆTx : V 0 V 0 γ t ˆTx : V 0 + V 0 ) 2K γ t, where we have used that ˆT x : : B K X ) B K X ) and V 0 B K X ). Therefore, by Lemma 5. we have V π V 2 + γ) + γ max 4K γt+ γ. T a V t ˆT x : av t Using the definition of t and 23) we get 8 ote that the lemma still holds if we replace the special operators ˆT x : a, ˆTx : π and ˆT x : by operators ˆT a, ˆT π, ˆT Lip γ) satisfying ˆT πv )x) = ˆT πx) V )x) and ˆT V )x) = max ˆT av )x). p 4 d, ε, δ, K, K p, L p, A, γ) = ) 2 24K + ) 52K 2 Kp 2 ε γ) 2 log 8 + log tε, γ, K) + ) 384K + ) 2 ) L p d + log A + d log ε γ) 2 + ) ) + log. δ 24) Let p 4 d, ε, δ, K, K p, L p, A, γ). Let V = ˆT t V 0 and let the stationary policy π be defined by ˆT π V = ˆT V. Then P V π V ε) δ. 25) Proof. The proof combines Lemmas 5.2 and 5.0. Firstly, we bound m = max Let π : X A be defined by ˆT t t a ˆT V 0 T a ˆT V 0. πx) = argmax ˆT a ˆT t V 0 T a ˆT t V 0 π does not depend on x). Then

12 2 m = ˆT t t π ˆT V 0 T π ˆT V 0 ˆT t π ˆT V 0 ˆT π T t V 0 + ˆT π T t V 0 T π T t V 0 + T π T t t V 0 T π ˆT V 0 2γ ˆT t V 0 T t V 0 + max ˆT a T t V 0 T a T t V 0 Therefore if 2γ + ) { max max ˆT a T t V 0 T a T t V 0, ˆT t V 0 T t V 0 }. p 3 d, ε γ)/42γ + )γ + )), δ, K, K p, L p, tε, γ, K), A, γ) then by Lemma 5.0 and Lemma 5.2, V π V ε with probability at least δ. In order to finish the proof of the main theorem we will prove that in discounted problems stochastic policies that generate ε-optimal actions with high probability are uniformly good. This result appears in the context of finite models in [9]. For completeness, we present the proof here. We start with the definition of ε-optimal actions and then prove three simple lemmas. Definition 5.4. Let ε > 0, and consider a discounted MDP X, A, p, r, γ). We call the set A ε x) = { a A : T a V )x) T V )x) ε }. the set of ε-optimal actions. Elements of this set are called ε-optimal. Lemma 5.5. Let π : X A [0, ] be a stationary stochastic policy that selects only ε-optimal actions: for all x X and a A from πx, a) > 0 it follows that a A ε x). Then V π V ε/ γ). Proof. From the definition of π it is immediate that T π V V ε. Clearly, T π V V and T π V )x) = πx, a)t a V )x) = ε x) εx) = V x) ε. πx, a)t a V )x) πx, a) T V )x) ε) ow, consider the telescoping sum k Tπ k V = T π V + T i+ π V TπV i ). Therefore, i= T k π V V k T π V V + T i+ π V TπV i ε + i= γ γ ε = ε γ. The next lemma will be applied to show if two policies are close to each other then so are their evaluation functions. Both the lemma and its proof are very similar to those of Proposition 5.8. Lemma 5.6. Let B be a space of bounded functions 9, B K { V B : V K }. Assume that T, T 2 : B K B K are such that for some α > 0 T V T 2 V α holds for all V B K and T Lip γ) for some 0 γ <. Then T s V T2 s V α/ γ). Further, let V be the fixed point of T and V2 be the fixed point of T 2. If T 2 Lip γ) then V V2 α/ γ). Proof. The proof is almost identical to that of Proposition 5.8. One proves by induction that T s V T s 2 V α/ γ) holds for all s 0. Here V B K is fixed. Indeed, the inequality holds for s = 0. Assuming that it holds for s with s one gets 9 Again, B could be any Banach-space.

13 3 T s V T2 s V T T s V T T2 s V + T T2 s V T 2 T2 s V γα/ γ) + α = α/ γ), showing the first part of the statement. The second part is proven by taking the limes superior of both sides when s. ow, we are ready to prove the lemma showing that policies that choose ε-optimal actions with high probability are uniformly good. Lemma 5.7. Let ε > 0, > δ > 0 be given. Let π : X A [0, ] be a stochastic policy that selects ε-optimal actions with probability at least δ. Then V π V ε + 2Kδ)/ γ). Proof. Let δx) = a A εx) πx, a) denote the probability of selecting non-ε-optimal actions in state x x X ). By assumption, δx) δ <. Let π : X A [0, ] be the policy defined by π x, a) = { πx,a) δx), if a A εx), 0, otherwise. We claim that T π and T π are close to each other. For, let V B K X ), where K = K r / γ). T π V )x) T π V )x) = πx, a) π x, a))t a V )x) and since T a V K, T π V T π V K Further, πx, a) π x, a) = ε x) + a A εx) πx, a) π x, a) πx, a) π x, a) πx, a) π x, a). = πx, a) πx, a) + πx, a) δx) ε x) a A ε x) = 2δx) 2δ. Therefore, T π V T π V 2Kδ. Since T π and T π and B K X ) satisfy the assumptions of Lemma 5.6, and the fixed point of T π and T π are V π and V π, respectively, we have V π V π 2Kδ/ γ). 26) Further, by construction π selects only ε-optimal actions and thus by Lemma 5.5, V π V ε/ γ). Combining this with 26), we get that V π V ε + 2Kδ)/ γ), finishing the proof. We are ready to prove the main result of the paper that was states earlier as Theorem 4.2: Theorem 5.8. Let K = K r / γ) and let ε > 0 Fix some V 0 B K X ). Let ε = ε γ)/2+γ)), δ = ε γ)/4k)= ε γ) 2 /4K r ) ). Further, let log32k/ε γ) t = tε 2 )), γ, K) = 27) log/γ) and let p 5 d, ε, K, K p, L p, A, γ) = ) 2 96K + ) 52K 2 Kp 2 ε γ) 3 log 8 + log t + ) 768K + ) 2 ) L p d + log A + d log ε γ) 3 + ) ) 4K + log. ε γ) Choose 28) p 5 d, ε, K, K p, L p, A, γ) 29) and let V = ˆT t V 0. Further, let the stochastic stationary policy π : X A [0, ] be defined by πx, a) = Pπ X : x) = a), 30) where π X : is the policy defined by ˆT πx : V = ˆT V. Then π is ε-optimal and given a state x, a random action of π can be computed in time and space polynomial in /ε, d, K, log L p, A and / γ).

14 4 Proof. The second part of the statement is immediate cf. Remark 3.2). The bound on the time of computation is Ot + ) 2 A ) 3) and the space requirement of the algorithm is 0 O + A ) 32) For the first part, fix X :. By Theorem 5.3, if V = V then V satisfies P V V πx : ε ) δ. We claim that if ω is such that V ω) V ε then π X : ω)a) A ε +γ)x). Let us pick up such an ω, let T = T ω); πx : note that V = T V. Then T V V T V T V + V V +γ)ε. Therefore, using the definition of A ε x), we get that π X : ω)a) A ε +γ)x). This shows that P π X : x) A ε +γ)x) ) δ. ow, by Lemma 5.7, the policy π defined by 30) is ε + γ) + 2Kδ)/ γ))-optimal, i.e., V π V ε + γ) + 2Kδ)/ γ). Substituting the definitions of ε and δ yields the result. 6. Conclusions and Further Work In this article we have considered an on-line planning algorithm that was shown to avoid the curse of dimensionality. Bounds following from Rust s original result by Markov s inequality were improved on in several ways: our bounds depend poly-logarithmically on the Lipschitz constant of the transition probabilities, they do not depend on the Lipschitz constant of the immediate rewards we dropped the assumption of having Lipschitzcontinuous immediate reward functions), and the number of samples depends on the cardinality of the action set in a poly-logarithmic way, as well. It is interesting to note that although our bounds depend poly-logarithmically on the Lipschitz constant of the transition probabilities char- 0 Assuming that only the normalization factors of the transition probabilities ˆp X : are stored. acterizing how fast the dynamics is), they depend polynomially on the bound of the transition probabilities characterizing the randomness of the MDP). Therefore, perhaps not surprisingly, for these kind of Monte-Carlo algorithms faster dynamics are easier to cope with than less random dynamics with peaky transition probability functions). As a consequence of our result, many interesting questions arise. For example, different variants of the proposed algorithm could be compared, such as multigrid versions, versions using quasi-random numbers, or versions that use importance sampling could be compared. In practice, one would probably choose not to recompute the cache C ε for each query. Also, in practice, one would probably precompute the transition probability table ˆp X : X i X j, a) and in order to speed-up the iterations one would probably eliminate the computation with those transition probability values that are very close to zero. This would considerably speed up the computations as one would expect that distant parts of the state space are uncoupled. However, the theoretical effect of these modifications needs to be explored. ote that the Lipschitz condition on p can be replaced by an appropriate condition on the metric-entropy of px, a) and the proofs will still go through. Therefore the proofs can be extended to Hölder-classes of transition laws or local Lipschitz classes e.g. px x, a) px x 2, a) Lx, a) x x 2 ) in this case one would need to use bracketing numbers), smooth functions, Sobolev classes, etc. One of the most interesting problems is to extend the results to infinite action spaces. For sure, such an extension needs some regularity assumptions on the dependence of the transition probability law and the reward function on the actions. It would also be interesting to prove analogous results for discrete MDPs having a factorized representation. The presented algorithm may find applications in economic problems without any modifications [2]. We also work on applications on deterministic continuous state-space, finite-action space control problems and partially observable MDPs over discrete spaces. Also, a combination with look-a-head search can be interesting from the practical point of view. The algorithm considered in the article was tried in practice on some standard problems car-on-

15 5 the-hill, acrobot) and it was observed to yield a reasonable performance even when the number of samples was kept quite small in the range of a few hundred to few thousand samples). It was also observed that boundary effects can interfere negatively with the algorithm. Details of these experiments, however, will be described elsewhere. References [] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, ew Jersey, 957. [2] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, J, USA, 989. [3] C.S. Chow and J.. Tsitsiklis. The complexity of dynamic programming. Journal of complexity, 5: , 989. [4] C.S. Chow and J.. Tsitsiklis. An optimal multigrid algorithm for continuous state discrete time stochastic control. IEEE Transactions on Automatic Control, 368):898 94, 99. [5] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Applications of Mathematics: Stochastic Modelling and Applied Probability. Springer-Verlag ew York, 996. [6] E.B. Dynkin and A.A. Yushkevich. Controlled Markov Processes. Springer-Verlag, Berlin, 979. [7] G.S. Fishman. Monte Carlo Concepts, Algorithms, and Applications. Springer-Verlag, 999. [8] M. Kearns, Y. Mansour, and A.Y. g. Approximate planning in large POMDPs via reusable trajectories. In S. A. Solla, T. K. Leen, and K. R. Müller, editors, Advances in eural Information Processing Systems 2. MIT Press, Cambridge, MA, 999. to appear. [9] M. Kearns, Y. Mansour, and A.Y. g. A sparse sampling algorithm for near-optimal planning in large Markovian decision processes. In Proceedings of IJ- CAI 99, 999. [0] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, ew York, 984. [] M.L. Puterman. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., ew York, Y, 994. [2] J. Rust. Structural estimation of Markov decision processes. In Handbook of Econometrics, volume 4, chapter 5, pages orth Holland, 994. [3] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65:487 56, 996. [4] R. J. Williams and L.C. Baird, III. Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 994.

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Total Expected Discounted Reward MDPs: Existence of Optimal Policies Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Extended dynamic programming: technical details

Extended dynamic programming: technical details A Extended dynamic programming: technical details The extended dynamic programming algorithm is given by Algorithm 2. Algorithm 2 Extended dynamic programming for finding an optimistic policy transition

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}. Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison,

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

APPROXIMATE ISOMETRIES ON FINITE-DIMENSIONAL NORMED SPACES

APPROXIMATE ISOMETRIES ON FINITE-DIMENSIONAL NORMED SPACES APPROXIMATE ISOMETRIES ON FINITE-DIMENSIONAL NORMED SPACES S. J. DILWORTH Abstract. Every ε-isometry u between real normed spaces of the same finite dimension which maps the origin to the origin may by

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Basic Deterministic Dynamic Programming

Basic Deterministic Dynamic Programming Basic Deterministic Dynamic Programming Timothy Kam School of Economics & CAMA Australian National University ECON8022, This version March 17, 2008 Motivation What do we do? Outline Deterministic IHDP

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

Simplex Algorithm for Countable-state Discounted Markov Decision Processes Simplex Algorithm for Countable-state Discounted Markov Decision Processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith November 16, 2014 Abstract We consider discounted Markov Decision

More information

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov

More information

An iterative procedure for constructing subsolutions of discrete-time optimal control problems

An iterative procedure for constructing subsolutions of discrete-time optimal control problems An iterative procedure for constructing subsolutions of discrete-time optimal control problems Markus Fischer version of November, 2011 Abstract An iterative procedure for constructing subsolutions of

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

INTRODUCTION TO MARKOV CHAIN MONTE CARLO INTRODUCTION TO MARKOV CHAIN MONTE CARLO 1. Introduction: MCMC In its simplest incarnation, the Monte Carlo method is nothing more than a computerbased exploitation of the Law of Large Numbers to estimate

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

A lower bound for scheduling of unit jobs with immediate decision on parallel machines

A lower bound for scheduling of unit jobs with immediate decision on parallel machines A lower bound for scheduling of unit jobs with immediate decision on parallel machines Tomáš Ebenlendr Jiří Sgall Abstract Consider scheduling of unit jobs with release times and deadlines on m identical

More information

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes Journal of Computer and System Sciences 74 (2008) 1309 1331 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.elsevier.com/locate/jcss An analysis of model-based Interval

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

21 Markov Decision Processes

21 Markov Decision Processes 2 Markov Decision Processes Chapter 6 introduced Markov chains and their analysis. Most of the chapter was devoted to discrete time Markov chains, i.e., Markov chains that are observed only at discrete

More information

The Optimal Stopping of Markov Chain and Recursive Solution of Poisson and Bellman Equations

The Optimal Stopping of Markov Chain and Recursive Solution of Poisson and Bellman Equations The Optimal Stopping of Markov Chain and Recursive Solution of Poisson and Bellman Equations Isaac Sonin Dept. of Mathematics, Univ. of North Carolina at Charlotte, Charlotte, NC, 2822, USA imsonin@email.uncc.edu

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS RALPH HOWARD DEPARTMENT OF MATHEMATICS UNIVERSITY OF SOUTH CAROLINA COLUMBIA, S.C. 29208, USA HOWARD@MATH.SC.EDU Abstract. This is an edited version of a

More information

Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor)

Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor) Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor) Matija Vidmar February 7, 2018 1 Dynkin and π-systems Some basic

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Richard S. Palais Department of Mathematics Brandeis University Waltham, MA The Magic of Iteration

Richard S. Palais Department of Mathematics Brandeis University Waltham, MA The Magic of Iteration Richard S. Palais Department of Mathematics Brandeis University Waltham, MA 02254-9110 The Magic of Iteration Section 1 The subject of these notes is one of my favorites in all mathematics, and it s not

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm

I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm Proposition : B(K) is complete. f = sup f(x) x K Proof.

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

On the Principle of Optimality for Nonstationary Deterministic Dynamic Programming

On the Principle of Optimality for Nonstationary Deterministic Dynamic Programming On the Principle of Optimality for Nonstationary Deterministic Dynamic Programming Takashi Kamihigashi January 15, 2007 Abstract This note studies a general nonstationary infinite-horizon optimization

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Regularity for Poisson Equation

Regularity for Poisson Equation Regularity for Poisson Equation OcMountain Daylight Time. 4, 20 Intuitively, the solution u to the Poisson equation u= f () should have better regularity than the right hand side f. In particular one expects

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME SAUL D. JACKA AND ALEKSANDAR MIJATOVIĆ Abstract. We develop a general approach to the Policy Improvement Algorithm (PIA) for stochastic control problems

More information

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Polynomial time Prediction Strategy with almost Optimal Mistake Probability Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

Random Feature Maps for Dot Product Kernels Supplementary Material

Random Feature Maps for Dot Product Kernels Supplementary Material Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains

More information

7: FOURIER SERIES STEVEN HEILMAN

7: FOURIER SERIES STEVEN HEILMAN 7: FOURIER SERIES STEVE HEILMA Contents 1. Review 1 2. Introduction 1 3. Periodic Functions 2 4. Inner Products on Periodic Functions 3 5. Trigonometric Polynomials 5 6. Periodic Convolutions 7 7. Fourier

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces Insoon Yang arxiv:1810.03847v1 [math.oc] 9 Oct 2018 Abstract A convex optimization-based method is proposed to

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

An Analysis of Model-Based Interval Estimation for Markov Decision Processes An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Stochastic Safest and Shortest Path Problems

Stochastic Safest and Shortest Path Problems Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012 Path optimization under probabilistic uncertainties Problems coming to searching for

More information

On the static assignment to parallel servers

On the static assignment to parallel servers On the static assignment to parallel servers Ger Koole Vrije Universiteit Faculty of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The Netherlands Email: koole@cs.vu.nl, Url: www.cs.vu.nl/

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dnamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality (October 29, 2016) Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/notes 2016-17/03 hsp.pdf] Hilbert spaces are

More information