An Empirical Dynamic Programming Algorithm for Continuous MDPs

Size: px

Start display at page:

Download "An Empirical Dynamic Programming Algorithm for Continuous MDPs"

Chrystal Price
5 years ago
Views:

1 An Empirical Dynamic Programming Algorithm for Continuous MDPs William B. Haskell, Rahul Jain, Hiteshi harma and Pengqian Yu arxiv: v math.oc ep 07 Abstract We propose universal randomized function approximation-based empirical value iteration EVI algorithms for Markov decision processes. The empirical nature comes from each iteration being done empirically from samples available from simulations of the next state. This makes the Bellman operator a random operator. A parametric and a non-parametric method for function approximation using a parametric function space and the Reproducing Kernal Hilbert pace RKH respectively are then combined with EVI. Both function spaces have the universal function approximation property. Basis functions are picked randomly. Convergence analysis is done using a random operator framework with techniques from the theory of stochastic dominance. Finite time sample complexity bounds are derived for both universal approximate dynamic programming algorithms. Numerical experiments support the versatility and effectiveness of this approach. Index Terms Continuous state space MDPs; Dynamic programming; Reinforcement Learning. I. INTRODUCTION There exist a wide variety of approximate dynamic programming DP, Chapter 6, and reinforcement learning RL algorithms 3 for finite state space Markov decision processes MDPs. But many real-world problems of interest have either a continuous state space, or large enough that it is best approximated as one. Approximate DP and RL algorithms do exist for continuous state space MDPs but choosing which one to employ is an art form: different techniques state space aggregation and function approximation 4 and algorithms work for different problems, and universally applicable algorithms are lacking. For example, fitted value iteration 5 is very effective for some problems but requires the choice of an appropriate basis functions for good approximation. No guidelines are available on how many basis functions to pick to ensure a certain quality of approximation. The strategy then employed is to use a large enough basis but has demanding computational complexity. In this paper, we propose approximate DP algorithms for continuous state space MDPs that are universal approximating function space can provide arbitrarily good approximation for any problem, computationally tractable, simple to implement and yet we have non-asymptotic sample complexity bounds. The first is accomplished by picking functions spaces for W.B. Haskell and P. Yu are with the Department of Industrial and ystems Engineering, National University of ingapore. Rahul Jain, and Hiteshi harma are with EE Department, University of outhern California. The second author s work was supported by an ONR Young Investigator Award #N Manuscript submitted: eptember 5, 07 approximation that are dense in the space of continuous functions. The second goal is achieved by relying on randomized selection of basis functions for approximation and also by empirical dynamic programming 6. The third is enabled because standard python routines can be used for function fitting and the fourth is by analysis in a random operator framework which provides non-asymptotic rate of convergence and sample complexity bounds. There is a large body of well-known literature on reinforcement learning and approximate dyamic programming for continuous state space MDPs. We discuss the most directly related. In 7, a sampling-based state space aggregation scheme combined with sample average approximation for the expectation in the Bellman operator is proposed. Under some regularity assumptions, the approximate value function can be computed at any state and an estimate of the expected error is given. But the algorithm seems to suffer from poor numerical performance. A linear programmingbased constraint-sampling approach was introduced in 8. Finite sample error guarantees, with respect to this constraintsampling distribution are provided but the method suffers from issues of feasibility. The closest paper to ours is 5 that does function fitting with a given basis and does empirical value iteration in each step. Unfortunately, it is not a universal method as approximation quality depends on the function basis picked. Other papers worth noting are 9 that discusses kernel-based value iteration and the bias-variance tradeoff, and 0 that proposed a kernel-based algorithm with random sampling of the state and action spaces, and proves asymptotic convergence. This paper is inspired by the random function approach that uses randomization to nearly solve otherwise intractable problems see e.g.,, and the empirical approach that reduces computational complexity of working with expectations 6. We propose two algorithms. For the first parametric approach, we pick a parametric function family. In each iteration a number of functions are picked randomly for function fitting by sampling the parameters. For the second non-parametric approach, we pick a RKH for approximation. Both function spaces are dense in the space of continuous functions. In each iteration, we sample a few states from the state space. Empirical value iteration EVI is then performed on these states. Each step of EVI involves approximating the Bellman operator with an empirical random Bellman operator by plugging a sample average approximation from simulation for the expectation. This is akin to doing stochastic approximations with step size. But an elegant random operator framework for convergence analysis using stochastic

2 dominance was introduced recently in 6 and comes in handy for us as well. The main contribution of this paper is development of randomized function approximation-based offline dynamic programming algorithms that are universally applicable i.e., do not require appropriate choice of basis functions for good approximation. A secondary contribution is further development of the random operator framework for convergence analysis in the L p norm that also yields finite time sample complexity bounds. The paper is organized as follows. ection II presents preliminaries including the continuous state space MDP model and the empirical dynamic programming framework for finite state MDPs introduced in 6. ection III presents two empirical value iteration algorithms - first, a randomized parametric function fitting method, and second, a non-parametric randomized function fitting in an RKH space. We also provide statements of main theorems about non-asymptotic error guarantees. ection IV presents a unified analysis of the two algorithms in a random operator framework. Numerical results are reported in ection V. econdary proofs are relegated to the appendix. II. PRELIMINARIE Consider a discrete time discounted MDP given by the 5-tuple,, A, Q, c, γ. The state space is a compact subset of R d with the Euclidean norm, with corresponding Borel σ algebra B. Let F be the space of all B measurable bounded functions f : R in the supremum norm f := sup s f s. Moreover, let M be the space of all probability distributions over and define the L norm as f, µ := f s µ ds for given µ M. We assume that the action space A is finite. The transition law Q governs the system evolution. For B B, Q B s, a is the probability of next visiting the set B given that action a A is chosen in state s. The cost function c : A R is a measurable function that depends on stateaction pairs. Finally, γ 0, is the discount factor. We will denote by Π the class of stationary deterministic Markov policies: mappings π : A which only depend on history through the current state. For a given state s, π s A is the action chosen in state s under the policy π. The state and action at time t are denoted s t and a t, respectively. Any policy π Π and initial state s determine a probability measure Ps π and a stochastic process s t, a t, t 0 defined on the canonical measurable space of trajectories of state-action pairs. The expectation operator with respect to Ps π is denoted E π s. We will assume that the cost function c satisfies c s, a c max < for all s, a A. Under this assumption, v π v max := c max / γ where v π is the value function for policy π Π defined as v π s = E π s t=0 γt c s t, a t, s. For later use, we define F ; v max to be the space of all functions f F such that f v max. The optimal value function is v s := inf π Π E π s t=0 γt c s t, a t, s. To characterize the optimal value function, we define the Bellman operator T : F F via T v s := min c s, a + γ EX Q s, a v X, s. It is known that the optimal value function v is a fixed point of T, i.e. T v = v. Classical value iteration is based on iterating T to obtain a fixed point, it produces a sequence v k k 0 F given by v k+ = T v k, k 0. Also, we know that v k k 0 converges to v geometrically in. We are interested in approximating the optimal value function v within a tractable class of approximating functions F F. We have the following definitions which we use to measure the approximation power of F with respect to T. We define d p, µ T f, F := inf f F f T f p, µ to be the approximation error for a specific f; and d p, µ T F, F := sup d p, µ T f, F f F to be the inherent L p Bellman error for the function class F. imilarly, the inherent L Bellman error for an approximating class F is d T F, F sup inf f T g. f F g F We often compare F to the Lipschitz continuous functions Lip L defined as f F : f s f s L s s, s, s. In our case, we say that an approximation class F is universal if d Lip L, F = 0 for all L 0. One of the difficulties of dynamic programming algorithms like value iteration above is that each iteration of the Bellman operator involves computation of an expectation which may be expensive. Thus, in 3, we proposed replacing the Bellman operator with an empirical or random Bellman operator, ˆTn v s := min c s, a + γ n n vx i, i= where X i are samples of the next state from Q s, a which can be obtained from simulation. Now, we can iterate the empirical Bellman operator, v k+ = ˆT n v k an algorithm we called Empirical Value Iteration EVI. The sequence of iterates v k is a random process. ince T is a contractive operator, its iterates converge to its fixed point v. The random operator ˆT n may be expected to inherit the contractive property in a probabilistic sense and its iterates converge to some sort of a probabilitic fixed point. We introduce ɛ, versions of two such notions introduced in 6. Definition. A function ˆv : R is an ɛ, -strong probabilistic fixed point for a sequence of random operators ˆT n if there exists an N such that for all n > N, P ˆT nˆv ˆv > ɛ <.

3 3 It is called a strong probabilistic fixed point, if the above is true for every positive ɛ and. Definition. A function ˆv : R is an ɛ, -weak probabilistic fixed point for a sequence of random operators ˆT n if there exist N and K such that for all n > N and all k > K, P ˆT n k v 0 ˆv > ɛ <, v 0 F. It is called a weak probabilistic fixed point, if the above is true for every positive ɛ and. Note that the stochastic iterative algorithms such as EVI often find the weak probabilistic fixed point of ˆT n whereas what we are looking for is v, the fixed point of T. In 3, it was shown that asymptotically the weak probabilistic fixed point of ˆT n coincides with its strong probabilistic fixed points which coincide with the fixed point of T under certain fairly weak assumptions and a natural relationship between T and ˆT n lim P ˆT n v T v > ɛ = 0, v F. n This implies that stochastic iterative algorithms such as EVI will find approximate fixed points of T with high probability. III. THE ALGORITHM AND MAIN REULT When the state space is very large, or even uncountable, exact dynamic programming methods are not practical, or even feasible. Instead, one must use a variety of approximation methods. In particular, function approximation or fitting the value function with a fixed function basis is a common technique. The idea is to sample a finite set of states from, approximate the Bellman update at these states, and then extend to the rest of through function fitting similar to 5. Furthermore, the expectation in the Bellman operator, for example, is also approximated by taking a number of samples of the next state. There are two main difficulties with this approach: First, the function fitting depends on the function basis chosen, making the results problem-dependent. econd, with a large basis for good approximation, function fitting can be computationally expensive. In this paper, we aim to address these issues by first picking universal approximating function spaces, and then using randomization to pick a smaller basis and thus reduce computational burden of the function fitting step. We consider two functional families, one is a parametric family FΘ parameterized over parameter space Θ and the other is a nonparametric regularized RKH. By µ M, we will denote a probability distribution from which to sample states in, and by a F F ; v max we will denote a functional family in which to do value function approximation. Let us denote by v k k 0 F ; v max, the iterates of the value functions produced by an algorithm and a sample of size N from is denoted s :N = s,..., s N. The empirical p norm of f is defined as f p p, ˆµ := N N n= f s n p for p, and as f, ˆµ := sup n=,..., N f s n for p =, where ˆµ is the empirical measure coresponding to the samples s :N. We will make the following technical assumptions for the rest of the paper similar to those made in 5. Assumption. i For all s, a A, Q s, a is absolutely continuous with respect to µ and C µ := sup dq s, a /dµ <. s, a A ii Given any sequence of policies π m m, the future state distribution ρ Q π Q πm is absolutely continuous with respect to µ, c ρ, µ m := sup d ρ Q π Q πm /dµ <, π,..., π m and C ρ, µ := m 0 γm c ρ, µ m <. The above assumptions are conditions on transition probabilities, the first being a sufficient condition for the second. A. Random Parametric Basis Function RPBF Approximation We introduce an empirical value iteration algorithm with function approximation using random parametrized basis functions EVI+RPBF. It requires a parametric family F built from a set of parameters Θ with probability distribution ν and a feature function φ : Θ R that depends on both states and parameters with the assumption that sup s, θ Θ φ s; θ. This can easily be met in practice by scaling φ whenever and Θ are both compact and φ is continuous in s, θ. Let α : Θ R be a weight function and define F Θ := f = φ ; θ α θ dθ : α θ C ν θ, θ Θ. Θ We note that the condition α θ C ν θ for all θ Θ is equivalent to requiring that α, ν := sup θ Θ α θ /ν θ C where α, ν is the ν weighted supremum norm of α and C is a constant. The function space F Θ may be chosen to have the universal function approximation property in the sense that any Lipschitz continuous function can be approximated arbitrarily closely in this space as shown in. By, Theorem, many such choices of F Θ are possible and are developed in, ection 5. For example, F Θ is universal in the following two cases: φ s; θ = cos ω, s + b where θ = ω, b R d+ ; and ν θ is given by ω Normal 0, γ I and b Uniform π, π; φ s; θ = sign s k t where θ = t, k R,..., d; and ν θ to be given by k Uniform,..., d and t Uniform a, a. In this approach, we have a parametric function family F Θ but instead of optimizing over parameters in Θ, we randomly sample them first and then do function fitting which involves optimizing over finite weighted combinations J j= α jφ ; θ j. Unfortunately, this leads to a non-convex optimization problem. Hence, instead of optimizing over θ :J = θ,..., θ J and α :J = α,..., α J jointly, we first do randomization over θ :J and then optimization over α :J, as in, to bypass the non-convexity inherent in optimizing over θ :J and α :J simultaneously. This approach allows us to deploy rich parametric families without much additional computational cost. Once we draw a

4 4 random sample θ j J j= from Θ according to ν, we obtain a random function space: F θ :J := J f = α j φ ; θ j : α,..., α J C/J. j= tep of such an algorithm Algorithm involves sampling states s :N over which to do value iteration and sampling parameters θ :J to pick basis functions φ ; θ which are used to do function fitting. tep involves doing an empirical value iteration over states s :N by sampling next states Xm sn, a M m= according to the transition kernel Q, and using the current iterate of the value function v k. Note that fresh i.i.d. samples of the next state are regenerated in each iteration. tep 3 involves finding the best fit to ṽ k, the iterate from tep, within F θ :J wherein randomly sampled parameters θ :J specify the basis functions for function fitting and weights α :J are optimized, which is a convex optimization problem. Algorithm EVI with random parameterized basis functions EVI+RPBF Input: probability distribution µ on and ν on Θ; ample sizes N, M, J ; initial seed v 0. For k =,..., K ample s n N n= µ N and θ j J j= νj. Compute ṽ k s n = min c s n, a + γ M M m= v k X sn, a m where Xm sn, a Q s n, a, m =,, M are i.i.d. 3 α k N J = arg min α N n= j= α jφ s n ; θ j ṽ s n s.t. α,..., α J C/J. v k+ s = J j= αk j φs; θ j. 4 Increment k k + and return to tep. We note that tep 3 of the algorithm can be replaced by another method for function fitting as we do in the next subsection. The above algorithm differs from Fitted Value Iteration FVI algorithm of 5 in how it does function fitting. FVI does function fitting with a deterministic and given set of basis functions which limits its universality while we do function fitting in a much larger space which has a universal function approximation property but are able to reduce computational complexity by exploiting randomization. In 5, ection 7, it is shown that if the transition kernel and cost are smooth, in the sense that and Q s, a Q s, a T V L Q s s c s, a c s, a L c s s hold for all s, s and a A, then the Bellman operator T maps bounded functions to Lipschitz continuous functions. In particular, if v is uniformly bounded by v max then T v is L c + γ v max L Q Lipschitz continuous. ubsequently, the inherent L Bellman error satisfies d T F, F, d Lip L, F since T F Lip L. o, it only remains to choose an F Θ that is dense in Lip L in the supremum norm, for which many examples exist. We now provide non-asymptotic sample complexity bounds to establish that Algorithm yields an approximately optimal value function with high probability. We provide guarantees for both the L and L metrics on the error. Denote N, = 7 5 v 40 e J + max log 0 e v max J, M, = J, = K = v max log 0 N A µ p; K = p p K,, 5C + log 5, ln Cρ, µ ln v max, and ln γ where C is the same constant that appears in the definition of F Θ see and v max = v max /. et := / /K. Theorem. Given an > 0, and a 0,, choose J J,, N N,, M M,, Then, for K log 4/ µ ; K, we have v K v, ρ C ρ, µ d, µ T F Θ, F Θ + with probability at least. Remarks. That is, if we choose enough samples N of the states, enough samples M of the next state, and enough random samples J of the parameter θ, and then for large enough number of iterations K, the L error in the value function is determined by the inherent Bellman error of the function class F Θ. For the function families FΘ discussed earlier, the inherent Bellman error is indeed zero, and so the value function will have small L error with high probability. Next we give a similar guarantee for L error for Algorithm by considering approximation in L, µ. We note that L, µ is a Hilbert space and that many powerful function approximation results exist for this setting because of the favorable properties of a Hilbert space. Denote N, = 7 5 v 40 e J + max 4 log 0 e v J max, M, = v max log 0 N A, J, = 5C + log 5, and ln C K ρ, / µ ln v max = ln γ, where again v max = v max /. et := / /K.

5 5 Theorem. Given an > 0, and a 0,, choose J J,, N N,, M M,, Then, for K log 4/ µ ; K, we have v K v, ρ γ / C / ρ, µ d, µ T F Θ, F Θ + with probability at least. Remarks. Again, note that the above result implies that if we choose a function families FΘ with inherent Bellman error zero, enough samples N of the states, enough samples M of the next state, and enough random samples J of the parameter θ, and then for large enough number of iterations K, the value function will have small L error with high probability. B. Non-parametric Function Approximation in RKH We now consider non-parametric function approximation combined with EVI. We employ a Reproducing Kernel Hilbert pace RKH for function approximation since for suitably chosen kernels, it is dense in the space of continuous functions and hence has a universal function approximation property. In the RKH setting, we can obtain guarantees directly with respect to the supremum norm. We will consider a regularized RKH setting with a continuous, symmetric and positive semidefinite kernel K : R and a regularization constant λ > 0. The RKH space, H K is defined to be the closure of the linear span of Ks, s endowed with an inner product, HK. The inner product, HK for H K is defined such that K x,, K y, HK = K x, y for all x, y, i.e., i α ik x i,, j β jk y j, HK = i, j α iβ j K x i, y j. ubsequently, the inner product satisfies the reproducing property: K s,, f HK = f s for all s and f H K. The corresponding RKH norm is defined in terms of the inner product f HK := f, f HK. We assume that our kernel K is bounded so that κ := sup s K s, s <. To find the best fit f H K to a function with data s n, ṽ s n N n=, we solve the regularized least squares problem: min f H K N N f s n ṽ s n + λ f H K. 3 n= This is a convex optimization problem the norm squared is convex, and has a closed form solution by the Representer Theorem. In particular, the optimal solution is of the form ˆf s = N n= α nk s n, s where the weights α :N = α,..., α N are the solution to the linear system K s i, s j Ni, j= + λ N I α n N n= = ṽ s n N n=. 4 This yields EVI algorithm with randomized function fitting in a regularized RKH EVI+RKH displayed as Algorithm. Note that the optimization problem in tep 3 in Algorithm is analogous to the optimization problem in tep 3 of Algorithm which finds an approximate best fit within the finite-dimensional space F θ :J, rather than the entire space F Θ, while Problem 3 in Algorithm optimizes over the entire space H K. This difference can be reconciled by the Representer Theorem, since it states that optimization over H K in Problem 3 is equivalent to optimization over the finitedimensional space spanned by K s n, : n =,..., N. Note that the regularization λ f H K is a requirement of the Representer Theorem. Algorithm EVI with regularized RKH EVI+RKH Input: probability distribution µ on ; sample sizes N, M ; penalty λ; initial seed v 0 ; counter k = 0. For k =,..., K ample s n N n= µ. Compute ṽ k s n = min arg min f H K N c s n, a + γ M M m= v k X sn, a m where X sn, a M j Q s m= n, a are i.i.d. 3 v k+ is given by N fs n ṽs n + λ f HK. n= 4 Increment k k + and return to tep. We define the regression function f M : R via f M s E min c s, a + γ M v Xm s, a, s, M m= it is the expected value of our empirical estimator of T v. As expected, f M T v as M. We note that f P is not necessarily equal to T v by Jensen s inequality. We require the following assumption on f M to continue. Assumption. For every M, f M s = K s, y α y µ dy for some α L, µ. Regression functions play a key role in statistical learning theory, Assumption states that the regression function lies in the span of the kernel K. Additionally, when H K is dense in the Lipschitz functions, then the inherent Bellman error is zero. For example, K s, s = exp γ s s, K s, s = a s s, and K s, s = exp γ s s are all universal kernels. Denote N, = M = K = 6 4 CK κ 4 log γ 60 vmax A γ 8 γ log vmax γ γ γ ln ln 4 vmax ln γ where C K is a constant independent of the dimension of see 4 for the details on how C K depends on K and set = / /K.,

6 6 Theorem 3. uppose Assumption holds. Given any > 0 and 0,, choose an N N, and an M M. Then, for any K log 4/ µ ; K, v K v with probability at least. Note that we provide guarantees on L and L error can be generalized to L p with the RPBF method and for L error with the RKH-based randomized function fitting method. Getting guarantees for the L p error with the RKH method has proved quite difficult, as has bounds on the L error with the RBPF method. IV. ANALYI IN A RANDOM OPERATOR FRAMEWORK We will analyze Algorithms and in terms of random operators since this framework is general enough to encompass many such algorithms. The reader can see that tep of both algorithms involves iteration of the empirical Bellman operator while tep 3 involves a randomized function fitting step which is done differently and in different spaces in both algorithms. We use random operator notation to write these algorithms in a compact way, and then derive a clean and to a largeextent unified convergence analysis. The key idea is to use the notion of stochastic dominance to bound the error process with an easy to analyze dominating Markov chain. Then, we can infer the solution quality of our algorithms via the probability distribution of the dominating Markov chain. This analysis idea refines and in fact, simplifies the idea we introduced in 3 for MDPs with finite state and action spaces where there is no function fitting in the supremum norm. In this paper, we develop the technique further, give a stronger convergence rate, account for randomized function approximation, and also generalize the technique to L p norms. We introduce a probability space Ω, B Ω, P on which to define random operators, where Ω is a sample space with elements denoted ω Ω, B Ω is the Borel σ algebra on Ω, and P is a probability distribution on Ω, B Ω. A random operator is an operator-valued random variable on Ω, B Ω, P. We define the first random operator on F as T v = s n, ṽ s n N n= where s n N n= is chosen from according to a distribution µ M and ṽ s n = min c s n, a + γ M M m= v X sn, a m n =,..., N is an approximation of T v s n for all n =,..., N. In other words, T maps from v F ; vmax to a randomly generated sample of N input-output pairs s n, ṽ s n N n= of the function T v. Note that T depends on sample sizes N and M. Next, we have the function reconstruction operator Π F which maps the data s n, ṽ s n N n= to an element in F. Note that Π F is not necessarily deterministic since Algorithms and use randomized function fitting. We can now write both algorithms succinctly as v k+ = Ĝ v k := Π F T vk, 5, which can be further written in terms of residual error k = Ĝ v k T v k as v k+ = Ĝ v k = T v k + k. 6 Iteration of these operators corresponds to repeated samples from Ω, BΩ, P, so we define the space of sequences Ω, BΩ, P where Ω = Ω with elements denoted ω = ω k k 0, B Ω = B Ω, and P is the probability measure on Ω, B Ω guaranteed by the Kolmogorov extension theorem applied to P. The random sequences v k k 0 in Algorithms and given by v k+ = Π F T ωk v k = Π F T ωk Π F T ωk Π F T ω0 v 0, for all k 0 is a stochastic process defined on Ω, BΩ, P. We now analyze error propagation over the iterations. Let us now investigate how the Bellman residual at each iteration of EVI affects the quality of the resulting policy. There have already been some results which address the error propagation both in L and L p p norms 5. After adapting 5, Lemma 3, we obtain the following point-wise error bounds on v K v in terms of the errors k k 0. Lemma 4. For any K, and > 0, suppose k p, µ for all k = 0,,..., K, then γ v K v K+ p p p, ρ Cρ, /p γ µ + γ K/p v max. 7 where C ρ, µ is the discounted-average concentrability coefficient of the future-state distributions as defined in 5, Assumption A. Note that Lemma 4 assumes that k, µ which we will show in the subsequent section that it is true with high probability. The second inequality is for the supremum norm. Lemma 5. For any K and > 0, suppose k for all k = 0,,..., K, then v K v / γ + γ K v max. 8 Inequalities 7 and 8 are the key to analyzing iteration of Equation 6. A. Convergence analysis using stochastic dominance We now provide a unified convergence analysis for iteration of a sequence of random operators given by 5 and 6. Later, we will show how it can be applied to Algorithms and. We will use to denote a general norm in the following discussion, since our idea applies to all instances of p, and p = simultaneously. The magnitude of the error in iteration k 0 is then k. We make the following key assumption for a general EVI algorithm. Assumption 3. For > 0, there is a q 0, such that Pr k q for all k 0.

7 7 Assumption 3 states that we can find a lower bound on the probability of the event k that is independent of k and v k k 0 but does depend on. Equivalently, we are giving a lower bound on the probability of the event T v k Ĝ v k. This is possible for all of the algorithms that we proposed earlier. In particular, we can control q in Assumption 3 through the sample sizes in each iteration of EVI. Naturally, for a given, q increases as the number of samples grows. We first choose > 0 and the number of iterations K for our EVI algorithms to reach a desired accuracy this choice of K comes from the inequalities 7 and 8. We call iteration k good if the error k is within our desired tolerance and bad when the error is greater than our desired tolerance. We then construct a stochastic process X k k 0 on Ω, B Ω, P with state space K :=,,..., K such that max X k,, if iteration k is "good", X k+ = K, otherwise. The stochastic process X k k 0 is easier to analyze than v k k 0 because it is defined on a finite state space, however X k k 0 is not necessarily a Markov chain. We next construct a dominating Markov chain Y k k 0 to help us analyze the behavior of X k k 0. We construct Y k k 0 on K, B, the canonical measurable space of trajectories on K, so Y k : K R, and we let Q denote the probability measure of Y k k 0 on R, B. ince Y k k 0 will be a Markov chain by construction, the probability measure Q is completely determined by an initial distribution on R and a transition kernel for Y k k 0. We always initialize Y 0 = K, and then construct the transition kernel as follows max Y k,, w.p. q, Y k+ = K, w.p. q, where q is the probability of a good iteration with respect to the corresponding norm. Note that the Y k k 0 we introduce here is different and has much smaller state space than the one we introduced in 6 leading to stronger convergence guarantees. We now describe a stochastic dominance relationship between the two stochastic processes X k k 0 and Y k k 0. We will establish that Y k k 0 is larger than X k k 0 in a stochastic sense. Definition 3. Let X and Y be two real-valued random variables, then X is stochastically dominated by Y, written X st Y, when E f X E f Y for all increasing functions f : R R. Equivalently, X st Y when Pr X θ Pr Y θ for all θ in the support of Y. Let F k k 0 be the filtration on Ω, B Ω, P corresponding to the evolution of information about X k k 0, and let X k+ F k denote the conditional distribution of X k+ given the information F k. We have the following initial results on the relationship between X k k 0 and Y k k 0. The following theorem, our main result for our random operator analysis, compares the marginal distributions of X k k 0 and Y k k 0 at all times k 0 when the two stochastic processes X k k 0 and Y k k 0 start from the same state. Theorem 6. i X k st Y k for all k 0. ii Pr Y k η Pr X k η for any η R and all k 0. Proof. First we note that, by 6, Lemma A., Y k+ Y k = η is stochastically increasing in η for all k 0, i.e. Y k+ Y k = η st Y k+ Y k = η for all η η. Then, by 3, Lemma A., X k+ X k = η, F k st Y k+ Y k = η for all η f and F k for all k 0. i Trivially, X 0 st Y 0 since X 0 as Y 0. Next, we see that X st Y by 3, Lemma A.. We prove the general case by induction. uppose X k st Y k for k, and for this proof define the random variable max θ,, w.p. q, Y θ = K, w.p. q. to be the conditional distribution of Y k conditional on θ, as a function of θ. We see that Y k+ has the same distribution as Y θ θ = Y k by definition. ince Y θ are stochastically increasing by Lemma 3, Lemma A., we see that Y θ θ = Y k st Y θ θ = X k by 6, Theorem.A.6 and our induction hypothesis. Now, Y θ θ = X k st X k+ X k, F k by 6, Theorem.A.3d and Lemma 3, Lemma A. for all histories F k. It follows that Y k+ X k+ by transitivity of st. ii Follows from part i by the definition of st. st By Theorem 6, if X K st Y K and we can make Pr Y K η large, then we will also obtain a meaningful bound on Pr X K η. Following this observation, the next two corollaries are the main mechanisms for our general sample complexity results for EVI. Corollary 7. For a given p,, and any > 0, and 0,, suppose Assumption 3 holds for this, and choose any K. Then for q / + / /K and K log 4/ / / q q K, we have γ v K v K + p p p, ρ Cρ, /p γ µ + γ K /p v max with probability at least. Proof. ince Y k k 0 is an irreducible Markov chain on a finite state space, its steady state distribution µ = µ i K i= on K exists. By 3, Lemma 4.3, the steady state distribution of Y k k 0 is µ = µ i K i= given by: µ = q K µ i = q q K i, i =,..., K, µ K = q. The constant µ min q; K := min q K, q q K, q, for all q 0, and K, which is the minimum of the steady state probabilities appears shortly in the Markov chain

8 8 mixing time bound for Y k k 0. We note that µ q; K = q q K µ min q; K is a simple lower bound for µ min q; K we defined µ q; K = q q K earlier. Now, recall that µ ν T V = K η= µ η ν η is the total variation distance for probability distributions on K. Let Q k be the marginal distribution of Y k for k 0. By a Markov chain mixing time argument, e.g., 7, Theorem.3, we have that t mix := min k 0 : Q k µ T V log µ min q; K log q q K for any 0,. o, for K log / q q K we have Pr Y K = µ = Pr Y K = q K Q K µ T V, where we use µ = q K. By Theorem 6, Pr X K = Pr Y K = and so Pr X K = q K. Choose q and to satisfy q K = / + / and = q K = / / to get q K, and the desired result follows. The next result uses the same reasoning for the supremum norm case. Corollary 8. Given any > 0 and 0,, suppose Assumption 3 holds for this, and choose any K. For q / + / /K and K log 4/ / / q q K, we have Pr v K v / γ + γ K v max. The sample complexity results for both EVI algorithms from ection III follow from Corollaries 7 and 8. This is shown next. B. Proofs of Theorems,, and 3 We now apply our random operator framework to both EVI algorithms. We will see that it is easy to check the conditions of Corollaries 7 and 8, from which we obtain specific sample complexity results. We can now give the proof of Theorem by using our Markov chain technique. Based on Theorem 6 in the appendix, we let p N, M, J, denote the lower bound on the probability of the event T v T v, µ for > 0. We also note that d, µ T v, F Θ d, µ T F Θ, F Θ for all v F Θ. Proof. of Theorem tarting with inequality 7 for p = and using the statement of Theorem 6, we have v K v, ρ C ρ, µ d, µ T F Θ, F Θ v max γ K when k, µ d, µ T F Θ, F Θ + for all k = 0,,..., K. Choose K such that ln 4 v max γ K C ρ, µ K Cρ, µ ln v max =. ln γ Based on Corollary 7, we just need to choose N, M, J such that p N, M, J, / /K. We then apply the statement of Theorem 6 with probability / /K. We now give the proof of Theorem along similar lines based on Theorem 7. Again, we let p N, M, J, denote the a lower bound on the probability of the event T v T v, µ. Proof. of Theorem tarting with inequality 7 for p = and using the statement of Theorem 7, we have v K v, ρ / Cρ, / µ d, µ T F Θ, F Θ + γ / +4 v maxγ K/, γ when k, µ d, µ T F Θ, F Θ + for all k = 0,,..., K. We choose K to satisfy / / 4 v maxγ K / Cρ, / γ γ µ which implies K = lnc / ρ, µ ln vmax ln γ. Based on Corollary 7, we just need to choose N, M, J such that p N, M, J, / /K. We then apply the statement of Theorem 6 with p = / /K. We now provide proof of L function fitting in RKH based on Theorem 9 in the appendix. For this proof, we let p N, M, denote a lower bound on the probability of the event T v T v. Proof. of Theorem 3 By inequality 8, we choose and K such that / γ ɛ/ and γ K v max ɛ/ by setting ln ɛ ln 4 K vmax. ln γ Based on Corollary 7, we next choose N and M such that p N, M, / /K. We then apply the statement of Theorem 9 with error ɛ γ / and probability / /K. V. NUMERICAL EXPERIMENT We now present numerical performance of our algorithm by testing it on the benchmark optimal replacement problem 7, 5. The setting is that a product such as a car becomes more costly to maintain with time/miles, and must be replaced it some point. Here, the state s t R + represents the accumulated utilization of the product. Thus, s t = 0 denotes a brand new durable good. Here, A = 0,, so at each time step, t, we can either replace the product a t = 0 or keep it a t =. Replacement incurs a cost C while keeping the product has

9 9 Fig.. Relative Error with iterations for various algorithms. a maintenance cost, cs t, associated with it. The transition probabilities are as follows: λe λst+ st, if s t+ s t and a t = qs t+ s t, a t = λe λst+, if s t+ 0 and a t = 0 0, otherwise and the reward function is given by cs t, if a t = rs t, a t = C c0, if a t = 0 For our computation, we use γ = 0.6, λ = 0.5, C = 30 and cs = 4s. The optimal value function and the optimal policy can be computed analytically for this problem. For EVI+RPBF, we use J random parameterized Fourier functions φs, θ j = cosθ T j s + bj j= with θ j N 0, 0.0 and b Unif π, π. We fix J=5. For EVI+RKH, we use Gaussian kernel defined as kx, y = exp x y /σ with /σ = 0.0 and L regularization. We fix the regularization coefficient to be 0. The underlying function space for FVI is polynomials of degree 4. The results are plotted after 0 iterations. The error in each iteration for different algorithms with N = 00 states and M = 5 is shown in Figure. On Y- axis, it shows the relative error computed as sup s v s v k s/v s with iterations k on the X-axis. It shows that EVI+RPBF has relative error below 0% after 0 iterations. FVI is close but EVI+RKH is slower though it may improve with a higher M. This is also refelected in the actual runtime performance: EVI+RPBF takes 8,705s, FVI 8,654s and EVI+RKH takes 4,73s to get within 0. relative error. Note that performance of FVI depends on being able to choose suitable basis functions which for the optimal Goal EVI+RPBF FVI EVI+RKH m 4.8m 8.7m m 3.7m 3.m m 4.5m 54.3m TABLE I RUNTIME PERFORMANCE OF VARIOU ALGORITHM ON THE CART-POLE PROBLEM. replacement problem is easy. For other problems, we may expect both EVI algorrithms to perform better. o, we tested the algorithms on the cart-pole balancing problem, another benchmark problem but for which the optimal value function is unknown. We formulate it as a continuous -dimensional state space with action MDP. The state comprises of the position of the cart and angle of the pole. The actions are to add a force of 0N or +0N to the cart, pushing it left or right. We add ±50% noise to these actions. ystem dynamics for this system are given in 3. Rewards are zero except for failure state if the position of cart reaches beyond ±.4, or the pole exceeds an angle of ± degrees, it is. For our experiments, we choose N = 00 and M =. In case of RPBF, we consider parameterized Fourier basis of the form cosw T s + b where w = w, w, w, w N 0, and b Unif π, π. We fix J = 0 for our EVI+RPBF. For RKH, we consider Gaussian kernel, Ks, s = exp σ s s / with σ = 0.0. We limit each episode to 000 time steps. We compute the average length of the episode for which we are able to balance the pole without hitting the failure state. This is the goal in Table I. The other columns show run-time needed for the algorithms to learn to achieve such a goal. From the table, we can see that EVI+RPBF outperforms FVI and EVI+RKH. Note that guarantees for FVI are only available for L -error and for EVI-RPBF for L p -error. EVI- RKH is the only algorithm that can provide guarantees on the sup-norm error. Also note that when for problems for which the value functions are not so regular, and good basis functions difficult to guess, the EVI+RKH method is likely to perform better but as of now we do not have a numerical example to demonstrate this. VI. CONCLUION In this paper, we have introduced universally applicable approximate dynamic programming algorithms for continuous state space MDPs. The algorithms introduced are based on using randomization to break the curse of dimensionality via the synthesis of the random function approximation and empirical approaches. Our first algorithm is based on a random parametric function fitting by sampling parameters in each iteration. The second is based on sampling states which then yield a set of basis functions in an RKH from the kernel. Both function fitting steps involve convex optimization problems and can be implemented with standard packages. Both algorithms can be viewed as iteration of a type of random Bellman operator followed by a random projection operator. This situation can be quite difficut to analyze. But viewed as a random operator, following 6, we can construct Markov chains that stochastically dominates the error sequences which are quite easy to analyze. They yield convergence but also nonasymptotic sample complexity bounds. Numerical experiments on the cart-pole balancing problem suggests good performance in practice. More rigorous numerical analysis will be conducted as part of future work.

10 0 REFERENCE D. P. Bertsekas, Dynamic programming and optimal control 3rd edition, volume ii, Belmont, MA: Athena cientific, 0. W. B. Powell, Approximate Dynamic Programming: olving the curses of dimensionality. John Wiley & ons, 007, vol R.. utton and A. G. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 998, vol., no.. 4 R. A. DeVore, Nonlinear approximation, Acta numerica, vol. 7, pp. 5 50, R. Munos and C. zepesvári, Finite-time bounds for fitted value iteration, The Journal of Machine Learning Research, vol. 9, pp , W. B. Haskell, R. Jain, and D. Kalathil, Empirical dynamic programming, Mathematics of Operations Research, vol. 4, no., pp , J. Rust, Using randomization to break the curse of dimensionality, Econometrica: Journal of the Econometric ociety, pp , D. P. De Farias and B. Van Roy, On constraint sampling in the linear programming approach to approximate dynamic programming, Mathematics of operations research, vol. 9, no. 3, pp , D. Ormoneit and Ś. en, Kernel-based reinforcement learning, Machine learning, vol. 49, no. -3, pp. 6 78, Grunewalder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton, Modelling transition dynamics in mdps with rkhs embeddings, arxiv preprint arxiv: , 0. A. Rahimi and B. Recht, Uniform approximation of functions with random bases, in Communication, Control, and Computing, th Annual Allerton Conference on. IEEE, 008, pp , Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, in Advances in neural information processing systems, 009, pp W. B. Haskell, R. Jain, and D. Kalathil, Empirical dynamic programming, To appear in Mathematics of Operations Research, male and D.-X. Zhou, hannon sampling ii: Connections to learning theory, Applied and Computational Harmonic Analysis, vol. 9, no. 3, pp , R. Munos, Performance bounds in l_p-norm for approximate value iteration, IAM journal on control and optimization, vol. 46, no., pp , M. haked and J. G. hanthikumar, tochastic Orders. pringer, D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times. American Mathematical ociety, M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations. cambridge university press, D. Haussler, phere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension, Journal of Combinatorial Theory, eries A, vol. 69, no., pp. 7 3, 995. A. upplement for ection III APPENDIX The following computation shows that T maps bounded functions to Lipschitz continuous functions when Q and c are both Lipschitz continuous in the sense of and. uppose v v max, then T v is Lipschitz continuous with constant L c + γ v max L Q. We have T v s T v s max c s, a c s, a + γ max v y Q dy s, a v y Q dy s, a L c s s + γ v max max Q dy s, a Q dy s, a L c + γ v max L Q s s. B. upplement for ection IV First, we need to adapt 5, Lemma 3 to obtain point-wise error bounds on v K v in terms of the errors k k 0. These bounds are especially useful when analyzing the performance of EVI with respect to other norms besides the supremum norm, since T does not have a contractive property with respect to any other norm. For any π Π, we define the operator Q π : F F which gives the transition mapping as a function of π via Q π v s v y Q dy s, π s, s. Then we define the operator T π : F F via T π v s c s, π s+γ v x Q dx s, π s, s. For later use, we let π Π be an optimal policy satisfying π s arg min c s, a + γ v x Q dx s, a, s,, i.e., it is greedy with respect to v. More generally, a policy π Π is greedy with respect to v F if T π v = T v. For use throughout this section, we let π k be a greedy policy with respect to v k so that T π k v k = T v k for all k 0. Then, for fixed K we define the operators A K Q π K + Q π K Q π K Q π0, A k Q π K k + Q π K Q π K Q k+ π, for k = 0,..., K, formed by composition of transition kernels. We let be the constant function equal to one on, and we define the constant γ = γk+ γ for use shortly. We note that A k K are all linear operators and A k = for all k = 0,..., K. Lemma 9. For any K, i v K v K K k γk k Q π k + γ K K Q π v0 v. ii v K v K γk k Q π K Q π K Q π k+ k + γ K Q π K Q π K Q π0 v 0 v. K iii v K v γk k A k k + γ K A K v max. Proof. i For any k, we have T v k T π v k and T π v k T π v = γ Q π v k v, so v k+ v = T v k + k T π v k + T π v k T π v γ Q π v k v + k. The result then follows by induction. ii imilarly, for any k, we have T v T π k v and T v k T π k v = T π k v k T π k v = γ Q π k v k v, so v k+ v = T v k + k T π k v + T π k v T v γ Q π k v k v + k. Again, the result follows by induction. iii If f g h in F, then g f + h, so combining parts i and ii gives K v K v γ K k A k k + γ K A K v 0 v.

11 Then we note that v 0 v v max. Now we use Lemma 9 to derive p norm bounds. K Proof. of Lemma 4 Using γk = γ K+ / γ, we define the constants γ γk k α k = γ K+, k = 0,..., K, γ γk α K = γ K+, and we note that K α k =. Then, we obtain v K v K γ α k A k k + α K A K v max from Lemma 9iii. Next, we compute v K v p p, ρ = v K s v s p ρ ds K p γ p α k A k k + α K A K v max s ρ ds γ p K α k A k k p + α K A K v max p s ρ ds, using Jensen s inequality and convexity of x x p. Now, we have ρ A k c ρ, µ K k µ for k = 0,..., K by Assumption ii and so for all k = 0,..., K, A k k p s ρ ds c ρ, µ K k k p p, µ. We arrive at v K v p p, ρ K γ p α k c ρ, µ K k k p p, µ + α K v max p K = p γ p γ K k c ρ, µ K k k p p, µ + γ K v which max p, requires Assumption ii. K v K v max γ K k Q π K k k +γ K Q π K v0 v, γ K k Q π K Q π K Q π K+ k K +γ K Q π K Q π K Q π0 v 0 v, by Lemma 9. Now, and γ K k Q π K k k K K K +γ K Q π K v0 v γ K k k + γ K v 0 v γ K k Q π K Q π K Q π K+ k +γ K Q π K Q π K Q π0 v 0 v K γ K k k + γ K v 0 v, where we use the triangle inequality, the fact that Q f s f y Q dy s f for any transition kernel Q on and f F, and v 0 v v max. For any K, v K v K γ K k k + γ K v max. 9 Follows immediately since K γk k / γ for all K. We emphasize that Lemma 5 does not require any assumptions on the transition probabilities, in contrast to Lemma 4 where we use v 0 v p v max p. Now, by subadditivity C. upplement for ection IV: Function approximation of x x t for t = /p 0, with p,, assumption We record several pertinent results here on the type of that k p, µ for all k = 0,,..., K, and since function reconstruction used in our EVI algorithms. The K γk k c ρ, µ K k C ρ, µ by Assumption first lemma is illustrative of approximation results in Hilbert ii, we see spaces, it gives an O / J convergence rate on the error γ v K v K+ p p from using F θ :J compared to F Θ in L, µ in probability. p, ρ Cρ, /p γ µ + γ K/p v max, which gives the desired result. Lemma 0., Lemma Fix f F Θ, for any 0, there exists a function ˆf F θ :J such that upremum norm error bounds follow more easily from Lemma 9. f ˆf, µ C + log J Proof. Proof of Lemma 5 We have with probability at least. The next result is an easy consequence of, Lemma and bounds the error from using F θ :J compared to F Θ in L, µ.

12 Lemma. Fix f F Θ, for any 0, there exists a function ˆf F θ :J such that ˆf f, µ C J + with probability at least. log Proof. Choose f, g F, then by Jensen s inequality we have f g, µ = E µ f g = E µ f g / E µ f g. The desired result then follows by, Lemma. Now we consider function approximation in the supremum norm. Recall the definition of the regression function f M s = E min c s, a + γ M v Xm s, a, M m= s. Then we have the following approximation result, for which we recall the constant κ := sup s K s, s. Corollary. 4, Corollary 5 For any 0,, f z, λ f M HK C log 4/ /6 log 4/ /3 κ for λ =, N N with probability at least. Proof. Uses the fact that for any f H K, f κ f HK. For any s, we have f s = K s,, f HK and subsequently K s,, f HK K s, HK f HK = K s,, K s, HK f HK = K s, s f HK sup K s, s f HK, s where the first inequality is by Cauchy-chwartz and the second is by assumption that K is a bounded kernel. The preceding result is about the error when approximating the regression function f M, but f M generally is not equal to T v. We bound the error between f M and T v as well in the next subsection. D. Bellman error The layout of this subsection is modeled after the arguments in 5, but with the added consideration of randomized function fitting. We use the following easy-to-establish fact. Fact 3. Let X be a given set, and f : X R and f : X R be two real-valued functions on X. Then, i inf x X f x inf x X f x sup x X f x f x, and ii sup x X f x sup x X f x sup x X f x f x. For example, Fact 3 can be used to show that T is contractive in the supremum norm. The next result is about T, it uses Hoeffding s inequality to bound the estimation error between ṽ s n N n= and T v s n N n= in probability. Lemma 4. For any p,, f, v F ; v max, and > 0, Pr f T v p, ˆµ f ṽ p, ˆµ > M N A exp. v max Proof. First we have f T v p, ˆµ f ṽ p, ˆµ T v ṽ p, ˆµ by the reverse triangle inequality. Then, for any s we have T v s ṽ s = max γ max c s, a + γ c s, a + γ M M m= v x Q dx s, a M v x Q dx s, a v X s, a m M v Xm s, a m= by Fact 3. We may also take v s 0, v max for all s by assumption on the cost function, so by the Hoeffding inequality and the union bound we obtain Pr max T v s n ṽ s n n=,..., N M N A exp and thus v max Pr T v ṽ p, ˆµ N = Pr N n= T v s n ṽ s n p /p Pr max n=,..., N T v s n ṽ s n, which gives the desired result. To continue, we introduce the following additional notation corresponding to a set of functions F F : F s :N f s,..., f s N : f F; N, F s :N is the covering number of F s :N with respect to the norm on R N. The next lemma uniformly bounds the estimation error between the true expectation and the empirical expectation over the set F θ :J in the following statement, e is Euler s number. Lemma 5. For any > 0 and N, N Pr sup f n E µ f N > f Fθ :J n= J e vmax N 8 e J + exp 8 vmax. Proof. For any F F ; v max, > 0, and N, we have

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter