An Empirical Dynamic Programming Algorithm for Continuous MDPs

Size: px
Start display at page:

Download "An Empirical Dynamic Programming Algorithm for Continuous MDPs"

Transcription

1 An Empirical Dynamic Programming Algorithm for Continuous MDPs William B. Haskell, Rahul Jain, Hiteshi harma and Pengqian Yu arxiv: v math.oc ep 07 Abstract We propose universal randomized function approximation-based empirical value iteration EVI algorithms for Markov decision processes. The empirical nature comes from each iteration being done empirically from samples available from simulations of the next state. This makes the Bellman operator a random operator. A parametric and a non-parametric method for function approximation using a parametric function space and the Reproducing Kernal Hilbert pace RKH respectively are then combined with EVI. Both function spaces have the universal function approximation property. Basis functions are picked randomly. Convergence analysis is done using a random operator framework with techniques from the theory of stochastic dominance. Finite time sample complexity bounds are derived for both universal approximate dynamic programming algorithms. Numerical experiments support the versatility and effectiveness of this approach. Index Terms Continuous state space MDPs; Dynamic programming; Reinforcement Learning. I. INTRODUCTION There exist a wide variety of approximate dynamic programming DP, Chapter 6, and reinforcement learning RL algorithms 3 for finite state space Markov decision processes MDPs. But many real-world problems of interest have either a continuous state space, or large enough that it is best approximated as one. Approximate DP and RL algorithms do exist for continuous state space MDPs but choosing which one to employ is an art form: different techniques state space aggregation and function approximation 4 and algorithms work for different problems, and universally applicable algorithms are lacking. For example, fitted value iteration 5 is very effective for some problems but requires the choice of an appropriate basis functions for good approximation. No guidelines are available on how many basis functions to pick to ensure a certain quality of approximation. The strategy then employed is to use a large enough basis but has demanding computational complexity. In this paper, we propose approximate DP algorithms for continuous state space MDPs that are universal approximating function space can provide arbitrarily good approximation for any problem, computationally tractable, simple to implement and yet we have non-asymptotic sample complexity bounds. The first is accomplished by picking functions spaces for W.B. Haskell and P. Yu are with the Department of Industrial and ystems Engineering, National University of ingapore. Rahul Jain, and Hiteshi harma are with EE Department, University of outhern California. The second author s work was supported by an ONR Young Investigator Award #N Manuscript submitted: eptember 5, 07 approximation that are dense in the space of continuous functions. The second goal is achieved by relying on randomized selection of basis functions for approximation and also by empirical dynamic programming 6. The third is enabled because standard python routines can be used for function fitting and the fourth is by analysis in a random operator framework which provides non-asymptotic rate of convergence and sample complexity bounds. There is a large body of well-known literature on reinforcement learning and approximate dyamic programming for continuous state space MDPs. We discuss the most directly related. In 7, a sampling-based state space aggregation scheme combined with sample average approximation for the expectation in the Bellman operator is proposed. Under some regularity assumptions, the approximate value function can be computed at any state and an estimate of the expected error is given. But the algorithm seems to suffer from poor numerical performance. A linear programmingbased constraint-sampling approach was introduced in 8. Finite sample error guarantees, with respect to this constraintsampling distribution are provided but the method suffers from issues of feasibility. The closest paper to ours is 5 that does function fitting with a given basis and does empirical value iteration in each step. Unfortunately, it is not a universal method as approximation quality depends on the function basis picked. Other papers worth noting are 9 that discusses kernel-based value iteration and the bias-variance tradeoff, and 0 that proposed a kernel-based algorithm with random sampling of the state and action spaces, and proves asymptotic convergence. This paper is inspired by the random function approach that uses randomization to nearly solve otherwise intractable problems see e.g.,, and the empirical approach that reduces computational complexity of working with expectations 6. We propose two algorithms. For the first parametric approach, we pick a parametric function family. In each iteration a number of functions are picked randomly for function fitting by sampling the parameters. For the second non-parametric approach, we pick a RKH for approximation. Both function spaces are dense in the space of continuous functions. In each iteration, we sample a few states from the state space. Empirical value iteration EVI is then performed on these states. Each step of EVI involves approximating the Bellman operator with an empirical random Bellman operator by plugging a sample average approximation from simulation for the expectation. This is akin to doing stochastic approximations with step size. But an elegant random operator framework for convergence analysis using stochastic

2 dominance was introduced recently in 6 and comes in handy for us as well. The main contribution of this paper is development of randomized function approximation-based offline dynamic programming algorithms that are universally applicable i.e., do not require appropriate choice of basis functions for good approximation. A secondary contribution is further development of the random operator framework for convergence analysis in the L p norm that also yields finite time sample complexity bounds. The paper is organized as follows. ection II presents preliminaries including the continuous state space MDP model and the empirical dynamic programming framework for finite state MDPs introduced in 6. ection III presents two empirical value iteration algorithms - first, a randomized parametric function fitting method, and second, a non-parametric randomized function fitting in an RKH space. We also provide statements of main theorems about non-asymptotic error guarantees. ection IV presents a unified analysis of the two algorithms in a random operator framework. Numerical results are reported in ection V. econdary proofs are relegated to the appendix. II. PRELIMINARIE Consider a discrete time discounted MDP given by the 5-tuple,, A, Q, c, γ. The state space is a compact subset of R d with the Euclidean norm, with corresponding Borel σ algebra B. Let F be the space of all B measurable bounded functions f : R in the supremum norm f := sup s f s. Moreover, let M be the space of all probability distributions over and define the L norm as f, µ := f s µ ds for given µ M. We assume that the action space A is finite. The transition law Q governs the system evolution. For B B, Q B s, a is the probability of next visiting the set B given that action a A is chosen in state s. The cost function c : A R is a measurable function that depends on stateaction pairs. Finally, γ 0, is the discount factor. We will denote by Π the class of stationary deterministic Markov policies: mappings π : A which only depend on history through the current state. For a given state s, π s A is the action chosen in state s under the policy π. The state and action at time t are denoted s t and a t, respectively. Any policy π Π and initial state s determine a probability measure Ps π and a stochastic process s t, a t, t 0 defined on the canonical measurable space of trajectories of state-action pairs. The expectation operator with respect to Ps π is denoted E π s. We will assume that the cost function c satisfies c s, a c max < for all s, a A. Under this assumption, v π v max := c max / γ where v π is the value function for policy π Π defined as v π s = E π s t=0 γt c s t, a t, s. For later use, we define F ; v max to be the space of all functions f F such that f v max. The optimal value function is v s := inf π Π E π s t=0 γt c s t, a t, s. To characterize the optimal value function, we define the Bellman operator T : F F via T v s := min c s, a + γ EX Q s, a v X, s. It is known that the optimal value function v is a fixed point of T, i.e. T v = v. Classical value iteration is based on iterating T to obtain a fixed point, it produces a sequence v k k 0 F given by v k+ = T v k, k 0. Also, we know that v k k 0 converges to v geometrically in. We are interested in approximating the optimal value function v within a tractable class of approximating functions F F. We have the following definitions which we use to measure the approximation power of F with respect to T. We define d p, µ T f, F := inf f F f T f p, µ to be the approximation error for a specific f; and d p, µ T F, F := sup d p, µ T f, F f F to be the inherent L p Bellman error for the function class F. imilarly, the inherent L Bellman error for an approximating class F is d T F, F sup inf f T g. f F g F We often compare F to the Lipschitz continuous functions Lip L defined as f F : f s f s L s s, s, s. In our case, we say that an approximation class F is universal if d Lip L, F = 0 for all L 0. One of the difficulties of dynamic programming algorithms like value iteration above is that each iteration of the Bellman operator involves computation of an expectation which may be expensive. Thus, in 3, we proposed replacing the Bellman operator with an empirical or random Bellman operator, ˆTn v s := min c s, a + γ n n vx i, i= where X i are samples of the next state from Q s, a which can be obtained from simulation. Now, we can iterate the empirical Bellman operator, v k+ = ˆT n v k an algorithm we called Empirical Value Iteration EVI. The sequence of iterates v k is a random process. ince T is a contractive operator, its iterates converge to its fixed point v. The random operator ˆT n may be expected to inherit the contractive property in a probabilistic sense and its iterates converge to some sort of a probabilitic fixed point. We introduce ɛ, versions of two such notions introduced in 6. Definition. A function ˆv : R is an ɛ, -strong probabilistic fixed point for a sequence of random operators ˆT n if there exists an N such that for all n > N, P ˆT nˆv ˆv > ɛ <.

3 3 It is called a strong probabilistic fixed point, if the above is true for every positive ɛ and. Definition. A function ˆv : R is an ɛ, -weak probabilistic fixed point for a sequence of random operators ˆT n if there exist N and K such that for all n > N and all k > K, P ˆT n k v 0 ˆv > ɛ <, v 0 F. It is called a weak probabilistic fixed point, if the above is true for every positive ɛ and. Note that the stochastic iterative algorithms such as EVI often find the weak probabilistic fixed point of ˆT n whereas what we are looking for is v, the fixed point of T. In 3, it was shown that asymptotically the weak probabilistic fixed point of ˆT n coincides with its strong probabilistic fixed points which coincide with the fixed point of T under certain fairly weak assumptions and a natural relationship between T and ˆT n lim P ˆT n v T v > ɛ = 0, v F. n This implies that stochastic iterative algorithms such as EVI will find approximate fixed points of T with high probability. III. THE ALGORITHM AND MAIN REULT When the state space is very large, or even uncountable, exact dynamic programming methods are not practical, or even feasible. Instead, one must use a variety of approximation methods. In particular, function approximation or fitting the value function with a fixed function basis is a common technique. The idea is to sample a finite set of states from, approximate the Bellman update at these states, and then extend to the rest of through function fitting similar to 5. Furthermore, the expectation in the Bellman operator, for example, is also approximated by taking a number of samples of the next state. There are two main difficulties with this approach: First, the function fitting depends on the function basis chosen, making the results problem-dependent. econd, with a large basis for good approximation, function fitting can be computationally expensive. In this paper, we aim to address these issues by first picking universal approximating function spaces, and then using randomization to pick a smaller basis and thus reduce computational burden of the function fitting step. We consider two functional families, one is a parametric family FΘ parameterized over parameter space Θ and the other is a nonparametric regularized RKH. By µ M, we will denote a probability distribution from which to sample states in, and by a F F ; v max we will denote a functional family in which to do value function approximation. Let us denote by v k k 0 F ; v max, the iterates of the value functions produced by an algorithm and a sample of size N from is denoted s :N = s,..., s N. The empirical p norm of f is defined as f p p, ˆµ := N N n= f s n p for p, and as f, ˆµ := sup n=,..., N f s n for p =, where ˆµ is the empirical measure coresponding to the samples s :N. We will make the following technical assumptions for the rest of the paper similar to those made in 5. Assumption. i For all s, a A, Q s, a is absolutely continuous with respect to µ and C µ := sup dq s, a /dµ <. s, a A ii Given any sequence of policies π m m, the future state distribution ρ Q π Q πm is absolutely continuous with respect to µ, c ρ, µ m := sup d ρ Q π Q πm /dµ <, π,..., π m and C ρ, µ := m 0 γm c ρ, µ m <. The above assumptions are conditions on transition probabilities, the first being a sufficient condition for the second. A. Random Parametric Basis Function RPBF Approximation We introduce an empirical value iteration algorithm with function approximation using random parametrized basis functions EVI+RPBF. It requires a parametric family F built from a set of parameters Θ with probability distribution ν and a feature function φ : Θ R that depends on both states and parameters with the assumption that sup s, θ Θ φ s; θ. This can easily be met in practice by scaling φ whenever and Θ are both compact and φ is continuous in s, θ. Let α : Θ R be a weight function and define F Θ := f = φ ; θ α θ dθ : α θ C ν θ, θ Θ. Θ We note that the condition α θ C ν θ for all θ Θ is equivalent to requiring that α, ν := sup θ Θ α θ /ν θ C where α, ν is the ν weighted supremum norm of α and C is a constant. The function space F Θ may be chosen to have the universal function approximation property in the sense that any Lipschitz continuous function can be approximated arbitrarily closely in this space as shown in. By, Theorem, many such choices of F Θ are possible and are developed in, ection 5. For example, F Θ is universal in the following two cases: φ s; θ = cos ω, s + b where θ = ω, b R d+ ; and ν θ is given by ω Normal 0, γ I and b Uniform π, π; φ s; θ = sign s k t where θ = t, k R,..., d; and ν θ to be given by k Uniform,..., d and t Uniform a, a. In this approach, we have a parametric function family F Θ but instead of optimizing over parameters in Θ, we randomly sample them first and then do function fitting which involves optimizing over finite weighted combinations J j= α jφ ; θ j. Unfortunately, this leads to a non-convex optimization problem. Hence, instead of optimizing over θ :J = θ,..., θ J and α :J = α,..., α J jointly, we first do randomization over θ :J and then optimization over α :J, as in, to bypass the non-convexity inherent in optimizing over θ :J and α :J simultaneously. This approach allows us to deploy rich parametric families without much additional computational cost. Once we draw a

4 4 random sample θ j J j= from Θ according to ν, we obtain a random function space: F θ :J := J f = α j φ ; θ j : α,..., α J C/J. j= tep of such an algorithm Algorithm involves sampling states s :N over which to do value iteration and sampling parameters θ :J to pick basis functions φ ; θ which are used to do function fitting. tep involves doing an empirical value iteration over states s :N by sampling next states Xm sn, a M m= according to the transition kernel Q, and using the current iterate of the value function v k. Note that fresh i.i.d. samples of the next state are regenerated in each iteration. tep 3 involves finding the best fit to ṽ k, the iterate from tep, within F θ :J wherein randomly sampled parameters θ :J specify the basis functions for function fitting and weights α :J are optimized, which is a convex optimization problem. Algorithm EVI with random parameterized basis functions EVI+RPBF Input: probability distribution µ on and ν on Θ; ample sizes N, M, J ; initial seed v 0. For k =,..., K ample s n N n= µ N and θ j J j= νj. Compute ṽ k s n = min c s n, a + γ M M m= v k X sn, a m where Xm sn, a Q s n, a, m =,, M are i.i.d. 3 α k N J = arg min α N n= j= α jφ s n ; θ j ṽ s n s.t. α,..., α J C/J. v k+ s = J j= αk j φs; θ j. 4 Increment k k + and return to tep. We note that tep 3 of the algorithm can be replaced by another method for function fitting as we do in the next subsection. The above algorithm differs from Fitted Value Iteration FVI algorithm of 5 in how it does function fitting. FVI does function fitting with a deterministic and given set of basis functions which limits its universality while we do function fitting in a much larger space which has a universal function approximation property but are able to reduce computational complexity by exploiting randomization. In 5, ection 7, it is shown that if the transition kernel and cost are smooth, in the sense that and Q s, a Q s, a T V L Q s s c s, a c s, a L c s s hold for all s, s and a A, then the Bellman operator T maps bounded functions to Lipschitz continuous functions. In particular, if v is uniformly bounded by v max then T v is L c + γ v max L Q Lipschitz continuous. ubsequently, the inherent L Bellman error satisfies d T F, F, d Lip L, F since T F Lip L. o, it only remains to choose an F Θ that is dense in Lip L in the supremum norm, for which many examples exist. We now provide non-asymptotic sample complexity bounds to establish that Algorithm yields an approximately optimal value function with high probability. We provide guarantees for both the L and L metrics on the error. Denote N, = 7 5 v 40 e J + max log 0 e v max J, M, = J, = K = v max log 0 N A µ p; K = p p K,, 5C + log 5, ln Cρ, µ ln v max, and ln γ where C is the same constant that appears in the definition of F Θ see and v max = v max /. et := / /K. Theorem. Given an > 0, and a 0,, choose J J,, N N,, M M,, Then, for K log 4/ µ ; K, we have v K v, ρ C ρ, µ d, µ T F Θ, F Θ + with probability at least. Remarks. That is, if we choose enough samples N of the states, enough samples M of the next state, and enough random samples J of the parameter θ, and then for large enough number of iterations K, the L error in the value function is determined by the inherent Bellman error of the function class F Θ. For the function families FΘ discussed earlier, the inherent Bellman error is indeed zero, and so the value function will have small L error with high probability. Next we give a similar guarantee for L error for Algorithm by considering approximation in L, µ. We note that L, µ is a Hilbert space and that many powerful function approximation results exist for this setting because of the favorable properties of a Hilbert space. Denote N, = 7 5 v 40 e J + max 4 log 0 e v J max, M, = v max log 0 N A, J, = 5C + log 5, and ln C K ρ, / µ ln v max = ln γ, where again v max = v max /. et := / /K.

5 5 Theorem. Given an > 0, and a 0,, choose J J,, N N,, M M,, Then, for K log 4/ µ ; K, we have v K v, ρ γ / C / ρ, µ d, µ T F Θ, F Θ + with probability at least. Remarks. Again, note that the above result implies that if we choose a function families FΘ with inherent Bellman error zero, enough samples N of the states, enough samples M of the next state, and enough random samples J of the parameter θ, and then for large enough number of iterations K, the value function will have small L error with high probability. B. Non-parametric Function Approximation in RKH We now consider non-parametric function approximation combined with EVI. We employ a Reproducing Kernel Hilbert pace RKH for function approximation since for suitably chosen kernels, it is dense in the space of continuous functions and hence has a universal function approximation property. In the RKH setting, we can obtain guarantees directly with respect to the supremum norm. We will consider a regularized RKH setting with a continuous, symmetric and positive semidefinite kernel K : R and a regularization constant λ > 0. The RKH space, H K is defined to be the closure of the linear span of Ks, s endowed with an inner product, HK. The inner product, HK for H K is defined such that K x,, K y, HK = K x, y for all x, y, i.e., i α ik x i,, j β jk y j, HK = i, j α iβ j K x i, y j. ubsequently, the inner product satisfies the reproducing property: K s,, f HK = f s for all s and f H K. The corresponding RKH norm is defined in terms of the inner product f HK := f, f HK. We assume that our kernel K is bounded so that κ := sup s K s, s <. To find the best fit f H K to a function with data s n, ṽ s n N n=, we solve the regularized least squares problem: min f H K N N f s n ṽ s n + λ f H K. 3 n= This is a convex optimization problem the norm squared is convex, and has a closed form solution by the Representer Theorem. In particular, the optimal solution is of the form ˆf s = N n= α nk s n, s where the weights α :N = α,..., α N are the solution to the linear system K s i, s j Ni, j= + λ N I α n N n= = ṽ s n N n=. 4 This yields EVI algorithm with randomized function fitting in a regularized RKH EVI+RKH displayed as Algorithm. Note that the optimization problem in tep 3 in Algorithm is analogous to the optimization problem in tep 3 of Algorithm which finds an approximate best fit within the finite-dimensional space F θ :J, rather than the entire space F Θ, while Problem 3 in Algorithm optimizes over the entire space H K. This difference can be reconciled by the Representer Theorem, since it states that optimization over H K in Problem 3 is equivalent to optimization over the finitedimensional space spanned by K s n, : n =,..., N. Note that the regularization λ f H K is a requirement of the Representer Theorem. Algorithm EVI with regularized RKH EVI+RKH Input: probability distribution µ on ; sample sizes N, M ; penalty λ; initial seed v 0 ; counter k = 0. For k =,..., K ample s n N n= µ. Compute ṽ k s n = min arg min f H K N c s n, a + γ M M m= v k X sn, a m where X sn, a M j Q s m= n, a are i.i.d. 3 v k+ is given by N fs n ṽs n + λ f HK. n= 4 Increment k k + and return to tep. We define the regression function f M : R via f M s E min c s, a + γ M v Xm s, a, s, M m= it is the expected value of our empirical estimator of T v. As expected, f M T v as M. We note that f P is not necessarily equal to T v by Jensen s inequality. We require the following assumption on f M to continue. Assumption. For every M, f M s = K s, y α y µ dy for some α L, µ. Regression functions play a key role in statistical learning theory, Assumption states that the regression function lies in the span of the kernel K. Additionally, when H K is dense in the Lipschitz functions, then the inherent Bellman error is zero. For example, K s, s = exp γ s s, K s, s = a s s, and K s, s = exp γ s s are all universal kernels. Denote N, = M = K = 6 4 CK κ 4 log γ 60 vmax A γ 8 γ log vmax γ γ γ ln ln 4 vmax ln γ where C K is a constant independent of the dimension of see 4 for the details on how C K depends on K and set = / /K.,

6 6 Theorem 3. uppose Assumption holds. Given any > 0 and 0,, choose an N N, and an M M. Then, for any K log 4/ µ ; K, v K v with probability at least. Note that we provide guarantees on L and L error can be generalized to L p with the RPBF method and for L error with the RKH-based randomized function fitting method. Getting guarantees for the L p error with the RKH method has proved quite difficult, as has bounds on the L error with the RBPF method. IV. ANALYI IN A RANDOM OPERATOR FRAMEWORK We will analyze Algorithms and in terms of random operators since this framework is general enough to encompass many such algorithms. The reader can see that tep of both algorithms involves iteration of the empirical Bellman operator while tep 3 involves a randomized function fitting step which is done differently and in different spaces in both algorithms. We use random operator notation to write these algorithms in a compact way, and then derive a clean and to a largeextent unified convergence analysis. The key idea is to use the notion of stochastic dominance to bound the error process with an easy to analyze dominating Markov chain. Then, we can infer the solution quality of our algorithms via the probability distribution of the dominating Markov chain. This analysis idea refines and in fact, simplifies the idea we introduced in 3 for MDPs with finite state and action spaces where there is no function fitting in the supremum norm. In this paper, we develop the technique further, give a stronger convergence rate, account for randomized function approximation, and also generalize the technique to L p norms. We introduce a probability space Ω, B Ω, P on which to define random operators, where Ω is a sample space with elements denoted ω Ω, B Ω is the Borel σ algebra on Ω, and P is a probability distribution on Ω, B Ω. A random operator is an operator-valued random variable on Ω, B Ω, P. We define the first random operator on F as T v = s n, ṽ s n N n= where s n N n= is chosen from according to a distribution µ M and ṽ s n = min c s n, a + γ M M m= v X sn, a m n =,..., N is an approximation of T v s n for all n =,..., N. In other words, T maps from v F ; vmax to a randomly generated sample of N input-output pairs s n, ṽ s n N n= of the function T v. Note that T depends on sample sizes N and M. Next, we have the function reconstruction operator Π F which maps the data s n, ṽ s n N n= to an element in F. Note that Π F is not necessarily deterministic since Algorithms and use randomized function fitting. We can now write both algorithms succinctly as v k+ = Ĝ v k := Π F T vk, 5, which can be further written in terms of residual error k = Ĝ v k T v k as v k+ = Ĝ v k = T v k + k. 6 Iteration of these operators corresponds to repeated samples from Ω, BΩ, P, so we define the space of sequences Ω, BΩ, P where Ω = Ω with elements denoted ω = ω k k 0, B Ω = B Ω, and P is the probability measure on Ω, B Ω guaranteed by the Kolmogorov extension theorem applied to P. The random sequences v k k 0 in Algorithms and given by v k+ = Π F T ωk v k = Π F T ωk Π F T ωk Π F T ω0 v 0, for all k 0 is a stochastic process defined on Ω, BΩ, P. We now analyze error propagation over the iterations. Let us now investigate how the Bellman residual at each iteration of EVI affects the quality of the resulting policy. There have already been some results which address the error propagation both in L and L p p norms 5. After adapting 5, Lemma 3, we obtain the following point-wise error bounds on v K v in terms of the errors k k 0. Lemma 4. For any K, and > 0, suppose k p, µ for all k = 0,,..., K, then γ v K v K+ p p p, ρ Cρ, /p γ µ + γ K/p v max. 7 where C ρ, µ is the discounted-average concentrability coefficient of the future-state distributions as defined in 5, Assumption A. Note that Lemma 4 assumes that k, µ which we will show in the subsequent section that it is true with high probability. The second inequality is for the supremum norm. Lemma 5. For any K and > 0, suppose k for all k = 0,,..., K, then v K v / γ + γ K v max. 8 Inequalities 7 and 8 are the key to analyzing iteration of Equation 6. A. Convergence analysis using stochastic dominance We now provide a unified convergence analysis for iteration of a sequence of random operators given by 5 and 6. Later, we will show how it can be applied to Algorithms and. We will use to denote a general norm in the following discussion, since our idea applies to all instances of p, and p = simultaneously. The magnitude of the error in iteration k 0 is then k. We make the following key assumption for a general EVI algorithm. Assumption 3. For > 0, there is a q 0, such that Pr k q for all k 0.

7 7 Assumption 3 states that we can find a lower bound on the probability of the event k that is independent of k and v k k 0 but does depend on. Equivalently, we are giving a lower bound on the probability of the event T v k Ĝ v k. This is possible for all of the algorithms that we proposed earlier. In particular, we can control q in Assumption 3 through the sample sizes in each iteration of EVI. Naturally, for a given, q increases as the number of samples grows. We first choose > 0 and the number of iterations K for our EVI algorithms to reach a desired accuracy this choice of K comes from the inequalities 7 and 8. We call iteration k good if the error k is within our desired tolerance and bad when the error is greater than our desired tolerance. We then construct a stochastic process X k k 0 on Ω, B Ω, P with state space K :=,,..., K such that max X k,, if iteration k is "good", X k+ = K, otherwise. The stochastic process X k k 0 is easier to analyze than v k k 0 because it is defined on a finite state space, however X k k 0 is not necessarily a Markov chain. We next construct a dominating Markov chain Y k k 0 to help us analyze the behavior of X k k 0. We construct Y k k 0 on K, B, the canonical measurable space of trajectories on K, so Y k : K R, and we let Q denote the probability measure of Y k k 0 on R, B. ince Y k k 0 will be a Markov chain by construction, the probability measure Q is completely determined by an initial distribution on R and a transition kernel for Y k k 0. We always initialize Y 0 = K, and then construct the transition kernel as follows max Y k,, w.p. q, Y k+ = K, w.p. q, where q is the probability of a good iteration with respect to the corresponding norm. Note that the Y k k 0 we introduce here is different and has much smaller state space than the one we introduced in 6 leading to stronger convergence guarantees. We now describe a stochastic dominance relationship between the two stochastic processes X k k 0 and Y k k 0. We will establish that Y k k 0 is larger than X k k 0 in a stochastic sense. Definition 3. Let X and Y be two real-valued random variables, then X is stochastically dominated by Y, written X st Y, when E f X E f Y for all increasing functions f : R R. Equivalently, X st Y when Pr X θ Pr Y θ for all θ in the support of Y. Let F k k 0 be the filtration on Ω, B Ω, P corresponding to the evolution of information about X k k 0, and let X k+ F k denote the conditional distribution of X k+ given the information F k. We have the following initial results on the relationship between X k k 0 and Y k k 0. The following theorem, our main result for our random operator analysis, compares the marginal distributions of X k k 0 and Y k k 0 at all times k 0 when the two stochastic processes X k k 0 and Y k k 0 start from the same state. Theorem 6. i X k st Y k for all k 0. ii Pr Y k η Pr X k η for any η R and all k 0. Proof. First we note that, by 6, Lemma A., Y k+ Y k = η is stochastically increasing in η for all k 0, i.e. Y k+ Y k = η st Y k+ Y k = η for all η η. Then, by 3, Lemma A., X k+ X k = η, F k st Y k+ Y k = η for all η f and F k for all k 0. i Trivially, X 0 st Y 0 since X 0 as Y 0. Next, we see that X st Y by 3, Lemma A.. We prove the general case by induction. uppose X k st Y k for k, and for this proof define the random variable max θ,, w.p. q, Y θ = K, w.p. q. to be the conditional distribution of Y k conditional on θ, as a function of θ. We see that Y k+ has the same distribution as Y θ θ = Y k by definition. ince Y θ are stochastically increasing by Lemma 3, Lemma A., we see that Y θ θ = Y k st Y θ θ = X k by 6, Theorem.A.6 and our induction hypothesis. Now, Y θ θ = X k st X k+ X k, F k by 6, Theorem.A.3d and Lemma 3, Lemma A. for all histories F k. It follows that Y k+ X k+ by transitivity of st. ii Follows from part i by the definition of st. st By Theorem 6, if X K st Y K and we can make Pr Y K η large, then we will also obtain a meaningful bound on Pr X K η. Following this observation, the next two corollaries are the main mechanisms for our general sample complexity results for EVI. Corollary 7. For a given p,, and any > 0, and 0,, suppose Assumption 3 holds for this, and choose any K. Then for q / + / /K and K log 4/ / / q q K, we have γ v K v K + p p p, ρ Cρ, /p γ µ + γ K /p v max with probability at least. Proof. ince Y k k 0 is an irreducible Markov chain on a finite state space, its steady state distribution µ = µ i K i= on K exists. By 3, Lemma 4.3, the steady state distribution of Y k k 0 is µ = µ i K i= given by: µ = q K µ i = q q K i, i =,..., K, µ K = q. The constant µ min q; K := min q K, q q K, q, for all q 0, and K, which is the minimum of the steady state probabilities appears shortly in the Markov chain

8 8 mixing time bound for Y k k 0. We note that µ q; K = q q K µ min q; K is a simple lower bound for µ min q; K we defined µ q; K = q q K earlier. Now, recall that µ ν T V = K η= µ η ν η is the total variation distance for probability distributions on K. Let Q k be the marginal distribution of Y k for k 0. By a Markov chain mixing time argument, e.g., 7, Theorem.3, we have that t mix := min k 0 : Q k µ T V log µ min q; K log q q K for any 0,. o, for K log / q q K we have Pr Y K = µ = Pr Y K = q K Q K µ T V, where we use µ = q K. By Theorem 6, Pr X K = Pr Y K = and so Pr X K = q K. Choose q and to satisfy q K = / + / and = q K = / / to get q K, and the desired result follows. The next result uses the same reasoning for the supremum norm case. Corollary 8. Given any > 0 and 0,, suppose Assumption 3 holds for this, and choose any K. For q / + / /K and K log 4/ / / q q K, we have Pr v K v / γ + γ K v max. The sample complexity results for both EVI algorithms from ection III follow from Corollaries 7 and 8. This is shown next. B. Proofs of Theorems,, and 3 We now apply our random operator framework to both EVI algorithms. We will see that it is easy to check the conditions of Corollaries 7 and 8, from which we obtain specific sample complexity results. We can now give the proof of Theorem by using our Markov chain technique. Based on Theorem 6 in the appendix, we let p N, M, J, denote the lower bound on the probability of the event T v T v, µ for > 0. We also note that d, µ T v, F Θ d, µ T F Θ, F Θ for all v F Θ. Proof. of Theorem tarting with inequality 7 for p = and using the statement of Theorem 6, we have v K v, ρ C ρ, µ d, µ T F Θ, F Θ v max γ K when k, µ d, µ T F Θ, F Θ + for all k = 0,,..., K. Choose K such that ln 4 v max γ K C ρ, µ K Cρ, µ ln v max =. ln γ Based on Corollary 7, we just need to choose N, M, J such that p N, M, J, / /K. We then apply the statement of Theorem 6 with probability / /K. We now give the proof of Theorem along similar lines based on Theorem 7. Again, we let p N, M, J, denote the a lower bound on the probability of the event T v T v, µ. Proof. of Theorem tarting with inequality 7 for p = and using the statement of Theorem 7, we have v K v, ρ / Cρ, / µ d, µ T F Θ, F Θ + γ / +4 v maxγ K/, γ when k, µ d, µ T F Θ, F Θ + for all k = 0,,..., K. We choose K to satisfy / / 4 v maxγ K / Cρ, / γ γ µ which implies K = lnc / ρ, µ ln vmax ln γ. Based on Corollary 7, we just need to choose N, M, J such that p N, M, J, / /K. We then apply the statement of Theorem 6 with p = / /K. We now provide proof of L function fitting in RKH based on Theorem 9 in the appendix. For this proof, we let p N, M, denote a lower bound on the probability of the event T v T v. Proof. of Theorem 3 By inequality 8, we choose and K such that / γ ɛ/ and γ K v max ɛ/ by setting ln ɛ ln 4 K vmax. ln γ Based on Corollary 7, we next choose N and M such that p N, M, / /K. We then apply the statement of Theorem 9 with error ɛ γ / and probability / /K. V. NUMERICAL EXPERIMENT We now present numerical performance of our algorithm by testing it on the benchmark optimal replacement problem 7, 5. The setting is that a product such as a car becomes more costly to maintain with time/miles, and must be replaced it some point. Here, the state s t R + represents the accumulated utilization of the product. Thus, s t = 0 denotes a brand new durable good. Here, A = 0,, so at each time step, t, we can either replace the product a t = 0 or keep it a t =. Replacement incurs a cost C while keeping the product has

9 9 Fig.. Relative Error with iterations for various algorithms. a maintenance cost, cs t, associated with it. The transition probabilities are as follows: λe λst+ st, if s t+ s t and a t = qs t+ s t, a t = λe λst+, if s t+ 0 and a t = 0 0, otherwise and the reward function is given by cs t, if a t = rs t, a t = C c0, if a t = 0 For our computation, we use γ = 0.6, λ = 0.5, C = 30 and cs = 4s. The optimal value function and the optimal policy can be computed analytically for this problem. For EVI+RPBF, we use J random parameterized Fourier functions φs, θ j = cosθ T j s + bj j= with θ j N 0, 0.0 and b Unif π, π. We fix J=5. For EVI+RKH, we use Gaussian kernel defined as kx, y = exp x y /σ with /σ = 0.0 and L regularization. We fix the regularization coefficient to be 0. The underlying function space for FVI is polynomials of degree 4. The results are plotted after 0 iterations. The error in each iteration for different algorithms with N = 00 states and M = 5 is shown in Figure. On Y- axis, it shows the relative error computed as sup s v s v k s/v s with iterations k on the X-axis. It shows that EVI+RPBF has relative error below 0% after 0 iterations. FVI is close but EVI+RKH is slower though it may improve with a higher M. This is also refelected in the actual runtime performance: EVI+RPBF takes 8,705s, FVI 8,654s and EVI+RKH takes 4,73s to get within 0. relative error. Note that performance of FVI depends on being able to choose suitable basis functions which for the optimal Goal EVI+RPBF FVI EVI+RKH m 4.8m 8.7m m 3.7m 3.m m 4.5m 54.3m TABLE I RUNTIME PERFORMANCE OF VARIOU ALGORITHM ON THE CART-POLE PROBLEM. replacement problem is easy. For other problems, we may expect both EVI algorrithms to perform better. o, we tested the algorithms on the cart-pole balancing problem, another benchmark problem but for which the optimal value function is unknown. We formulate it as a continuous -dimensional state space with action MDP. The state comprises of the position of the cart and angle of the pole. The actions are to add a force of 0N or +0N to the cart, pushing it left or right. We add ±50% noise to these actions. ystem dynamics for this system are given in 3. Rewards are zero except for failure state if the position of cart reaches beyond ±.4, or the pole exceeds an angle of ± degrees, it is. For our experiments, we choose N = 00 and M =. In case of RPBF, we consider parameterized Fourier basis of the form cosw T s + b where w = w, w, w, w N 0, and b Unif π, π. We fix J = 0 for our EVI+RPBF. For RKH, we consider Gaussian kernel, Ks, s = exp σ s s / with σ = 0.0. We limit each episode to 000 time steps. We compute the average length of the episode for which we are able to balance the pole without hitting the failure state. This is the goal in Table I. The other columns show run-time needed for the algorithms to learn to achieve such a goal. From the table, we can see that EVI+RPBF outperforms FVI and EVI+RKH. Note that guarantees for FVI are only available for L -error and for EVI-RPBF for L p -error. EVI- RKH is the only algorithm that can provide guarantees on the sup-norm error. Also note that when for problems for which the value functions are not so regular, and good basis functions difficult to guess, the EVI+RKH method is likely to perform better but as of now we do not have a numerical example to demonstrate this. VI. CONCLUION In this paper, we have introduced universally applicable approximate dynamic programming algorithms for continuous state space MDPs. The algorithms introduced are based on using randomization to break the curse of dimensionality via the synthesis of the random function approximation and empirical approaches. Our first algorithm is based on a random parametric function fitting by sampling parameters in each iteration. The second is based on sampling states which then yield a set of basis functions in an RKH from the kernel. Both function fitting steps involve convex optimization problems and can be implemented with standard packages. Both algorithms can be viewed as iteration of a type of random Bellman operator followed by a random projection operator. This situation can be quite difficut to analyze. But viewed as a random operator, following 6, we can construct Markov chains that stochastically dominates the error sequences which are quite easy to analyze. They yield convergence but also nonasymptotic sample complexity bounds. Numerical experiments on the cart-pole balancing problem suggests good performance in practice. More rigorous numerical analysis will be conducted as part of future work.

10 0 REFERENCE D. P. Bertsekas, Dynamic programming and optimal control 3rd edition, volume ii, Belmont, MA: Athena cientific, 0. W. B. Powell, Approximate Dynamic Programming: olving the curses of dimensionality. John Wiley & ons, 007, vol R.. utton and A. G. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 998, vol., no.. 4 R. A. DeVore, Nonlinear approximation, Acta numerica, vol. 7, pp. 5 50, R. Munos and C. zepesvári, Finite-time bounds for fitted value iteration, The Journal of Machine Learning Research, vol. 9, pp , W. B. Haskell, R. Jain, and D. Kalathil, Empirical dynamic programming, Mathematics of Operations Research, vol. 4, no., pp , J. Rust, Using randomization to break the curse of dimensionality, Econometrica: Journal of the Econometric ociety, pp , D. P. De Farias and B. Van Roy, On constraint sampling in the linear programming approach to approximate dynamic programming, Mathematics of operations research, vol. 9, no. 3, pp , D. Ormoneit and Ś. en, Kernel-based reinforcement learning, Machine learning, vol. 49, no. -3, pp. 6 78, Grunewalder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton, Modelling transition dynamics in mdps with rkhs embeddings, arxiv preprint arxiv: , 0. A. Rahimi and B. Recht, Uniform approximation of functions with random bases, in Communication, Control, and Computing, th Annual Allerton Conference on. IEEE, 008, pp , Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, in Advances in neural information processing systems, 009, pp W. B. Haskell, R. Jain, and D. Kalathil, Empirical dynamic programming, To appear in Mathematics of Operations Research, male and D.-X. Zhou, hannon sampling ii: Connections to learning theory, Applied and Computational Harmonic Analysis, vol. 9, no. 3, pp , R. Munos, Performance bounds in l_p-norm for approximate value iteration, IAM journal on control and optimization, vol. 46, no., pp , M. haked and J. G. hanthikumar, tochastic Orders. pringer, D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times. American Mathematical ociety, M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations. cambridge university press, D. Haussler, phere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension, Journal of Combinatorial Theory, eries A, vol. 69, no., pp. 7 3, 995. A. upplement for ection III APPENDIX The following computation shows that T maps bounded functions to Lipschitz continuous functions when Q and c are both Lipschitz continuous in the sense of and. uppose v v max, then T v is Lipschitz continuous with constant L c + γ v max L Q. We have T v s T v s max c s, a c s, a + γ max v y Q dy s, a v y Q dy s, a L c s s + γ v max max Q dy s, a Q dy s, a L c + γ v max L Q s s. B. upplement for ection IV First, we need to adapt 5, Lemma 3 to obtain point-wise error bounds on v K v in terms of the errors k k 0. These bounds are especially useful when analyzing the performance of EVI with respect to other norms besides the supremum norm, since T does not have a contractive property with respect to any other norm. For any π Π, we define the operator Q π : F F which gives the transition mapping as a function of π via Q π v s v y Q dy s, π s, s. Then we define the operator T π : F F via T π v s c s, π s+γ v x Q dx s, π s, s. For later use, we let π Π be an optimal policy satisfying π s arg min c s, a + γ v x Q dx s, a, s,, i.e., it is greedy with respect to v. More generally, a policy π Π is greedy with respect to v F if T π v = T v. For use throughout this section, we let π k be a greedy policy with respect to v k so that T π k v k = T v k for all k 0. Then, for fixed K we define the operators A K Q π K + Q π K Q π K Q π0, A k Q π K k + Q π K Q π K Q k+ π, for k = 0,..., K, formed by composition of transition kernels. We let be the constant function equal to one on, and we define the constant γ = γk+ γ for use shortly. We note that A k K are all linear operators and A k = for all k = 0,..., K. Lemma 9. For any K, i v K v K K k γk k Q π k + γ K K Q π v0 v. ii v K v K γk k Q π K Q π K Q π k+ k + γ K Q π K Q π K Q π0 v 0 v. K iii v K v γk k A k k + γ K A K v max. Proof. i For any k, we have T v k T π v k and T π v k T π v = γ Q π v k v, so v k+ v = T v k + k T π v k + T π v k T π v γ Q π v k v + k. The result then follows by induction. ii imilarly, for any k, we have T v T π k v and T v k T π k v = T π k v k T π k v = γ Q π k v k v, so v k+ v = T v k + k T π k v + T π k v T v γ Q π k v k v + k. Again, the result follows by induction. iii If f g h in F, then g f + h, so combining parts i and ii gives K v K v γ K k A k k + γ K A K v 0 v.

11 Then we note that v 0 v v max. Now we use Lemma 9 to derive p norm bounds. K Proof. of Lemma 4 Using γk = γ K+ / γ, we define the constants γ γk k α k = γ K+, k = 0,..., K, γ γk α K = γ K+, and we note that K α k =. Then, we obtain v K v K γ α k A k k + α K A K v max from Lemma 9iii. Next, we compute v K v p p, ρ = v K s v s p ρ ds K p γ p α k A k k + α K A K v max s ρ ds γ p K α k A k k p + α K A K v max p s ρ ds, using Jensen s inequality and convexity of x x p. Now, we have ρ A k c ρ, µ K k µ for k = 0,..., K by Assumption ii and so for all k = 0,..., K, A k k p s ρ ds c ρ, µ K k k p p, µ. We arrive at v K v p p, ρ K γ p α k c ρ, µ K k k p p, µ + α K v max p K = p γ p γ K k c ρ, µ K k k p p, µ + γ K v which max p, requires Assumption ii. K v K v max γ K k Q π K k k +γ K Q π K v0 v, γ K k Q π K Q π K Q π K+ k K +γ K Q π K Q π K Q π0 v 0 v, by Lemma 9. Now, and γ K k Q π K k k K K K +γ K Q π K v0 v γ K k k + γ K v 0 v γ K k Q π K Q π K Q π K+ k +γ K Q π K Q π K Q π0 v 0 v K γ K k k + γ K v 0 v, where we use the triangle inequality, the fact that Q f s f y Q dy s f for any transition kernel Q on and f F, and v 0 v v max. For any K, v K v K γ K k k + γ K v max. 9 Follows immediately since K γk k / γ for all K. We emphasize that Lemma 5 does not require any assumptions on the transition probabilities, in contrast to Lemma 4 where we use v 0 v p v max p. Now, by subadditivity C. upplement for ection IV: Function approximation of x x t for t = /p 0, with p,, assumption We record several pertinent results here on the type of that k p, µ for all k = 0,,..., K, and since function reconstruction used in our EVI algorithms. The K γk k c ρ, µ K k C ρ, µ by Assumption first lemma is illustrative of approximation results in Hilbert ii, we see spaces, it gives an O / J convergence rate on the error γ v K v K+ p p from using F θ :J compared to F Θ in L, µ in probability. p, ρ Cρ, /p γ µ + γ K/p v max, which gives the desired result. Lemma 0., Lemma Fix f F Θ, for any 0, there exists a function ˆf F θ :J such that upremum norm error bounds follow more easily from Lemma 9. f ˆf, µ C + log J Proof. Proof of Lemma 5 We have with probability at least. The next result is an easy consequence of, Lemma and bounds the error from using F θ :J compared to F Θ in L, µ.

12 Lemma. Fix f F Θ, for any 0, there exists a function ˆf F θ :J such that ˆf f, µ C J + with probability at least. log Proof. Choose f, g F, then by Jensen s inequality we have f g, µ = E µ f g = E µ f g / E µ f g. The desired result then follows by, Lemma. Now we consider function approximation in the supremum norm. Recall the definition of the regression function f M s = E min c s, a + γ M v Xm s, a, M m= s. Then we have the following approximation result, for which we recall the constant κ := sup s K s, s. Corollary. 4, Corollary 5 For any 0,, f z, λ f M HK C log 4/ /6 log 4/ /3 κ for λ =, N N with probability at least. Proof. Uses the fact that for any f H K, f κ f HK. For any s, we have f s = K s,, f HK and subsequently K s,, f HK K s, HK f HK = K s,, K s, HK f HK = K s, s f HK sup K s, s f HK, s where the first inequality is by Cauchy-chwartz and the second is by assumption that K is a bounded kernel. The preceding result is about the error when approximating the regression function f M, but f M generally is not equal to T v. We bound the error between f M and T v as well in the next subsection. D. Bellman error The layout of this subsection is modeled after the arguments in 5, but with the added consideration of randomized function fitting. We use the following easy-to-establish fact. Fact 3. Let X be a given set, and f : X R and f : X R be two real-valued functions on X. Then, i inf x X f x inf x X f x sup x X f x f x, and ii sup x X f x sup x X f x sup x X f x f x. For example, Fact 3 can be used to show that T is contractive in the supremum norm. The next result is about T, it uses Hoeffding s inequality to bound the estimation error between ṽ s n N n= and T v s n N n= in probability. Lemma 4. For any p,, f, v F ; v max, and > 0, Pr f T v p, ˆµ f ṽ p, ˆµ > M N A exp. v max Proof. First we have f T v p, ˆµ f ṽ p, ˆµ T v ṽ p, ˆµ by the reverse triangle inequality. Then, for any s we have T v s ṽ s = max γ max c s, a + γ c s, a + γ M M m= v x Q dx s, a M v x Q dx s, a v X s, a m M v Xm s, a m= by Fact 3. We may also take v s 0, v max for all s by assumption on the cost function, so by the Hoeffding inequality and the union bound we obtain Pr max T v s n ṽ s n n=,..., N M N A exp and thus v max Pr T v ṽ p, ˆµ N = Pr N n= T v s n ṽ s n p /p Pr max n=,..., N T v s n ṽ s n, which gives the desired result. To continue, we introduce the following additional notation corresponding to a set of functions F F : F s :N f s,..., f s N : f F; N, F s :N is the covering number of F s :N with respect to the norm on R N. The next lemma uniformly bounds the estimation error between the true expectation and the empirical expectation over the set F θ :J in the following statement, e is Euler s number. Lemma 5. For any > 0 and N, N Pr sup f n E µ f N > f Fθ :J n= J e vmax N 8 e J + exp 8 vmax. Proof. For any F F ; v max, > 0, and N, we have

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM? HOW O CHOOSE HE SAE RELEVANCE WEIGH OF HE APPROXIMAE LINEAR PROGRAM? YANN LE ALLEC AND HEOPHANE WEBER Abstract. he linear programming approach to approximate dynamic programming was introduced in [1].

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Analysis of Classification-based Policy Iteration Algorithms

Analysis of Classification-based Policy Iteration Algorithms Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh

More information

Random Feature Maps for Dot Product Kernels Supplementary Material

Random Feature Maps for Dot Product Kernels Supplementary Material Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

2014:05 Incremental Greedy Algorithm and its Applications in Numerical Integration. V. Temlyakov

2014:05 Incremental Greedy Algorithm and its Applications in Numerical Integration. V. Temlyakov INTERDISCIPLINARY MATHEMATICS INSTITUTE 2014:05 Incremental Greedy Algorithm and its Applications in Numerical Integration V. Temlyakov IMI PREPRINT SERIES COLLEGE OF ARTS AND SCIENCES UNIVERSITY OF SOUTH

More information

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Marek Petrik IBM T.J. Watson Research Center, Yorktown, NY, USA Abstract Approximate dynamic programming is a popular method

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

= µ(s, a)c(s, a) s.t. linear constraints ;

= µ(s, a)c(s, a) s.t. linear constraints ; A CONVE ANALYTIC APPROACH TO RISK-AWARE MARKOV DECISION PROCESSES WILLIAM B. HASKELL AND RAHUL JAIN Abstract. In classical Markov decision process (MDP) theory, we search for a policy that say, minimizes

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces Insoon Yang arxiv:1810.03847v1 [math.oc] 9 Oct 2018 Abstract A convex optimization-based method is proposed to

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

Lecture 4 Lebesgue spaces and inequalities

Lecture 4 Lebesgue spaces and inequalities Lecture 4: Lebesgue spaces and inequalities 1 of 10 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 4 Lebesgue spaces and inequalities Lebesgue spaces We have seen how

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Technical Appendix for: When Promotions Meet Operations: Cross-Selling and Its Effect on Call-Center Performance

Technical Appendix for: When Promotions Meet Operations: Cross-Selling and Its Effect on Call-Center Performance Technical Appendix for: When Promotions Meet Operations: Cross-Selling and Its Effect on Call-Center Performance In this technical appendix we provide proofs for the various results stated in the manuscript

More information

Some Background Math Notes on Limsups, Sets, and Convexity

Some Background Math Notes on Limsups, Sets, and Convexity EE599 STOCHASTIC NETWORK OPTIMIZATION, MICHAEL J. NEELY, FALL 2008 1 Some Background Math Notes on Limsups, Sets, and Convexity I. LIMITS Let f(t) be a real valued function of time. Suppose f(t) converges

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Online gradient descent learning algorithm

Online gradient descent learning algorithm Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk

More information

Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models

Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models Arnab Bhattacharya 1 and Jeffrey P. Kharoufeh 2 Department of Industrial Engineering University of Pittsburgh

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Math 328 Course Notes

Math 328 Course Notes Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past. 1 Markov chain: definition Lecture 5 Definition 1.1 Markov chain] A sequence of random variables (X n ) n 0 taking values in a measurable state space (S, S) is called a (discrete time) Markov chain, if

More information

MATH4406 Assignment 5

MATH4406 Assignment 5 MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

STA 711: Probability & Measure Theory Robert L. Wolpert

STA 711: Probability & Measure Theory Robert L. Wolpert STA 711: Probability & Measure Theory Robert L. Wolpert 6 Independence 6.1 Independent Events A collection of events {A i } F in a probability space (Ω,F,P) is called independent if P[ i I A i ] = P[A

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games Gabriel Y. Weintraub, Lanier Benkard, and Benjamin Van Roy Stanford University {gweintra,lanierb,bvr}@stanford.edu Abstract

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS PROBABILITY: LIMIT THEOREMS II, SPRING 218. HOMEWORK PROBLEMS PROF. YURI BAKHTIN Instructions. You are allowed to work on solutions in groups, but you are required to write up solutions on your own. Please

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

Simplex Algorithm for Countable-state Discounted Markov Decision Processes Simplex Algorithm for Countable-state Discounted Markov Decision Processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith November 16, 2014 Abstract We consider discounted Markov Decision

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information