Yishay Mansour. AT&T Labs and Tel-Aviv University. design special-purpose planning algorithms that exploit. this structure.

Size: px

Start display at page:

Download "Yishay Mansour. AT&T Labs and Tel-Aviv University. design special-purpose planning algorithms that exploit. this structure."

Gilbert Henry
5 years ago
Views:

1 A Sparse Sampling Algoritm for Near-Optimal Planning in Large Markov Decision Processes Micael Kearns AT&T Labs Yisay Mansour AT&T Labs and Tel-Aviv University Andrew Y. Ng UC Berkeley Abstract An issue tat is critical for te application of Markov decision processes (MDPs) to realistic problems is ow te complexity of planning scales wit te size of te MDP. In stocastic environments wit very large or even innite state spaces, traditional planning and reinforcement learning algoritms are often inapplicable, since teir running time typically scales linearly wit te state space size in te worst case. In tis paper we present a new algoritm tat, given only a generative model (simulator) for an arbitrary MDP, performs near-optimal planning wit a running time tat as no dependence on te number of states. Altoug te running time is exponential in te orizon time (wic depends only on te discount factor and te desired degree of approximation to te optimal policy), our results establis for te rst time tat tere are no teoretical barriers to computing near-optimal policies in arbitrarily large, unstructured MDPs. 1 Introduction In te past decade, Markov decision processes (MDPs) and reinforcement learning ave become a standard framework for planning and learning under uncertainty witin te articial intelligence literature. Te desire to attack problems of increasing complexity wit tis formalism as recently led researcers to focus particular attention on te case of (exponentially or even in- nitely) large state spaces. A number of interesting algoritmic and representational suggestions ave been made for coping wit suc large MDPs. Function approximation [SB98] is a well-studied approac to learning value functions in large state spaces, and many autors ave recently begun to study te properties of large MDPs tat enjoy compact representations, suc as MDPs in wic te state transition probabilities factor into a small number of components [MHK + 98]. In tis paper, we are interested in te problem of computing a near-optimal policy in a large or innite MDP tat is given tat is, we are interested in planning. It sould be clear tat as an MDP becomes very large, te classical planning assumption tat te MDP is given explicitly by tables of rewards and transition probabilities becomes infeasible. One approac to tis diculty is to assume tat te MDP as some special structure tat permits compact representation (suc as te factored transition probabilities mentioned above), and to design special-purpose planning algoritms tat exploit tis structure. Here we take a rater dierent approac. We consider a setting in wic our planning algoritm is given access to a generative model, orsimulator, of te MDP. Informally, tis is a \black box" to wic we can give any state-action pair (s; a), and receive in return a randomly sampled next state and reward from te distributions associated wit (s; a). Generative models are a natural way in wic a large MDP migt be specied, and are more general tan most structured representations, in te sense tat structured representations usually provide an ecient way of implementing a generative model. Note also tat since a generative model provides less information tan explicit tables of probabilities, but more information tan a single continuous trajectory of experience generated according to some exploration policy, results obtained via a generative model blur te distinction between wat is typically called \planning" and \learning" in MDPs. Our main result is a new algoritm tat accesses te given generative model to perform near-optimal planning in an \on-line" fasion. From any given state s, te algoritm samples te generative model for many dierent state-action pairs, and uses tese samples to compute a near-optimal action from s. Te amount of time required to compute a near-optimal action from any particular state s as no dependence on te number of states in te MDP, even toug te next-state distributions from s may of course be spread over te entire state space. Te key to our analysis is in sowing tat appropriate sparse sampling suces to construct enoug information about te environment near s to compute a near-optimal action. Te analysis relies on a combination of Bellman equation calculations, wic are standard in reinforcement learning, and uniform convergence arguments, wic are standard in supervised learning; tis combina-

2 tion of tecniques was rst applied in [KS99]. As mentioned, te running time required at eac state does ave an exponential dependence on te orizon time (wic can be sown to be unavoidable witout furter assumptions). Note tat tis learning algoritm is itself simply a (stocastic) policy tat appens to use a generative model as a subroutine. In tis sense, if we view te generative model as providing a \compact" representation of te MDP, our algoritm provides a correspondingly \compact" representation of a near-optimal policy. We view our result as complimentary to work tat proposes and exploits particular compact representations of MDPs [MHK + 98], wit bot lines of work beginning to demonstrate te potential feasibility of planning and learning in very large environments. Preliminaries We begin wit te denition of a Markov decision process on a set of N = jsj states, explicitly allowing te possibility of te number of states being (countably or uncountably) innite. Denition 1 A Markov decision process M on a set of states S and wit actions fa 1 ;:::;a k g consists of: Transition Probabilities: For eac state-action pair (s; a), a next-state distribution P sa (s 0 ) tat species te probability of transition to eac state s 0 upon execution of action a from state s. Reward Distributions: For eac state-action pair (s; a), a distribution R sa on real-valued rewards for executing action a from state s. We assume rewards are bounded in absolute value by R max. For simplicity, we sall assume in tis paper tat all rewards are in fact deterministic. However, all of our results ave easy generalizations for te case of stocastic rewards, wit an appropriate and necessary dependence on te variance of te reward distributions. Denition A generative model for a Markov decision process M is a randomized algoritm tat, on input of a state-action pair (s; a), outputs R sa and a state s 0, were s 0 is randomly drawn according to te transition probabilities P sa (). Following standard terminology in reinforcement learning, we dene a (stocastic) policy to be any mapping : S 7! fa 1 ;:::;a k g.tus (s) may be a random variable, but depends only on te current state s. We will be primarily concerned wit discounted reinforcement learning 1,sowe assume we are givenanumber 0 <1called te discount factor, wit wic we ten dene te value function V for any policy : V (s) =E " X 1 # i,1 r i s; (1) 1 However, most of our results ave straigtforward generalizations to te undiscounted nite-orizon case for any xed orizon H. were r i is te reward received on te it step of executing te policy from state s, and te expectation is over te transition probabilities and any randomization in. Note tat for any s and any, jv (s)j V max, were we dene V max = R max =(1, ). We also dene te Q-function for a given policy as Q (s; a) =R sa + E s 0P sa() [V (s 0 )] () (were te notation s 0 P sa () means tat s 0 is drawn according to te distribution P sa ()). We will later describe an algoritm A tat takes as input any state s and (stocastically) outputs an action a, and wic terefore implements a policy. Wen we ave suc anal- goritm, we will also write V A and Q A to denote te value function and Q-function of te policy implemented by A. Finally, we dene te optimal value function and te optimal Q-function as V (s) = sup V (s) and Q (s; a) = sup Q (s; a), and te optimal policy, (s) = arg max a Q (s; a) for all s S. 3 Planning in Large or Innite MDPs Usually one considers te planning problem in MDPs to be tat of computing a near-optimal policy, given as input te transition probabilities P sa () and te rewards R sa (for instance, by solving te MDP for te optimal policy). Tus, te input is a complete and exact model, and te output is a total mapping from states to actions. Witout additional assumptions about te structure of te MDP, suc an approac is clearly infeasible in very large state spaces, were even reading all of te input can take N time, and even specifying a general policy requires space on te order of N. In suc MDPs, a more fruitful way of tinking about planning migt be an online view, in wic we examine te per-state complexity of planning. Tus, te input to a planning algoritm would be a single state, and te output would be wic single action to take from tat state. In tis on-line view, a planning algoritm is itself simply a policy (but one tat may need to perform some nontrivial computation at eac state). Our main result is te description and analysis of an algoritm A tat, given access to a generative model for an arbitrary MDP M, takes any state of M as input and produces an action as output, and meets te following performance criteria: Te policy implemented by A is near-optimal in M; Te running time of A (tat is, te time required to compute an action at any state) as no dependence on te number of states of M. Tis result is obtained under te assumption tat te input state to A requires only O(1) space, a standard assumption known as te uniform cost model [AHU74], tat is typically adopted to allow analysis of algoritms tat operate on real numbers (suc as we require to allow innite state spaces). If one is unappy wit tis model, ten algoritm A will suer a dependence on te number of states only equal to te space required to name te states (at worst log(n) for N states).

3 3.1 A Sparse Sampling Planner Here is our main result: Teorem 1 Tere is a randomized algoritm A tat, given access to a generative model for any MDP M, takes as input any state s S and any value ">0, outputs an action, and satises te following two conditions: (Eciency) Te running time of A is O((kC) H ), were C = V max H = log (=V max ) ; H log khv max + log R max =((1, ) )=4; V max = R max =(1, ): ; In particular, te running time depends only on R max,, and ", and does not depend on N = jsj. If we view R max as a constant, tis can also be written k "(1, ) O, 1 1, log, 1 "(1,) : (3) (Near-Optimality) Te value function of te stocastic policy implemented by A satises jv A (s), V (s)j " (4) simultaneously for all states s S. As we ave already suggested, it will be elpful to tink of algoritm A in two dierent ways. On te one and, A is an algoritm tat takes a state as input and as access to a generative model, and as suc we sall be interested in its resource complexity its running time, and te number of calls it needs to make to te generative model (bot per state input). On te oter and, A produces an action as output in response to eac state given as input, and tus implements a (possibly stocastic) policy. Wile a sketc of te proof of Teorem 1 is given in Appendix A, and detailed pseudo-code for te algoritm is provided in Figure 1, we now give some ig-level intuition for te algoritm and its analysis. For te sake of simplicity, let us consider only te twoaction case ere, wit actions a 1 and a. Recall tat te optimal policy at s is given by (s) = arg max a Q (s; a), and terefore is completely determined by, and easily calculated from, Q (s; ). Estimating te Q-values is a common way of planning in MDPs, and te basic idea of our algoritm is to nd good estimates of Q (s; a) for all actions a by looking only witin a small neigborood of s. In particular, for our algoritm to run in time tat does not depend on N = jsj, it is critical tat te size of tis neigborood does not depend on N, even toug, for example, s may ave very diuse transition probabilities, so tat it is possible to reac any oter state in S from s. From te standard duality between Q-functions and value functions, te task of estimating Q-functions is very similar to tat of estimating value functions. So wile te algoritm uses te Q-function, we will, purely for expository purposes, actually describe ere ow we estimate V (s). Tere are two parts to te approximation we use. First, rater tan estimating V,we will actually estimate, for a value of H to be specied later, te H-step expected discounted reward V (s) =E " X # i,1 r i s; (6) were r i is te reward received on te it time step upon executing te optimal policy from s. Note te \0- step" expected discounted reward is easy to estimate: Since V 0 (s) =0,wemaysimply pick our 0-step estimates to be ^V 0 (s) = 0. Moreover, we see tat te V (s), for 1, are recursively given by V (s) = R sa + E s 0P sa ()[V,1(s 0 )] max a fr sa + E s 0Psa()[V,1 (s0 )]g (7) were a is te action taken by te optimal policy from state s. Te quality of te approximation in Equation (7) becomes better for larger values of, and is controllably tigt for te largest value = H we eventually coose. One of te main eorts in te proof is establising tat te error incurred by te recursive application of tis approximation can be made controllably small by coosing H suciently large. Tus, if we are able to obtain an estimate ^V,1 (s 0 )of V,1 (s0 ) for any s 0,we can inductively dene an algoritm for nding an estimate ^V (s) ofv (s) by making use of Equation (7). Our algoritm will approximate te expectation in Equation (7) by a sample of C random next states from te generative model, were C is a parameter to be determined (and wic, for reasons tat will become clear later, we call te \widt"). Recursively, givenaway of nding te estimator ^V,1 (s0 ) for any s 0,we nd our estimate ^V (s) ofv (s) as follows: 1. For eac action a, use te generative model to get R sa and to sample a set S a of C independently sampled states from te next-state distribution P sa ().. Use our procedure for nding ^V,1 to estimate ^V,1 (s0 ) for eac state s 0 in any of te sets S a. 3. Following Equation (7), our estimate of V (s) is ten given by ^V (s) = max a ( R sa + 1 C X s 0 Sa ^V,1(s 0 ) ) : (8) We ave described our algoritm \bottom up," but it is also informative to view it \top down." Our algoritm is essentially building a sparse look-aead tree. Figure sows a conceptual picture of tis tree for a run of te algoritm from an input state s 0, for C =3.(C will typically be muc larger.) From te root s 0,we try action a 1 tree times and action a tree times. From eac of

4 Function: EstimateQ(; C; ; G; s) Input: dept, widt C, discount, A generative model G, state s. Output: A list ( ^Q (s; a 1); ^Q (s; a );:::; ^Q (s; a k)), of estimates of te Q (s; a i). 1. If n = 0, return (0;:::;0).. For eac a A, use G to generate C samples from te next-state distribution P sa(). Let S a be a set containing tese C next-states. 3. For eac a A and let our estimate of Q (s; a) be X ^Q (s; a) 1 0 =R(s; a)+ EstimateV(, 1;C;;G;s ): (5) C s 0 S a 4. Return ( ^Q (s; a 1); ^Q (s; a );:::; ^Q (s; a k)). Function: EstimateV(; C; ; G; s) Input: dept, widt C, discount, generative model G, state s. Output: A number ^V (s) tat is an estimate of V (s). 1. Let ( ^Q (s; a 1); ^Q (s; a );:::; ^Q (s; a k)) := EstimateQ(; C; ; G; s).. Return max afa1 ;:::;a k gf ^Q (s; a)g. Function: Algoritm A(; ; R max;g;s 0) Input: tolerance, discount, max reward R max, generative model G, state s 0. Output: An action a. 1. Let te required orizon H and widt C parameters be calculated as given as functions of, and R max in Teorem1.. Let ( ^Q H(s; a 1); ^Q H(s; a );:::; ^Q H(s; a k)) := EstimateQ(H; C; ; G; s 0). 3. Return arg max afa 1;:::;a k gf ^Q H (s; a)g. Figure 1: Algoritm A for planning in large or innite state spaces. EstimateV nds te ^V EstimateQ nds analogously dened ^Q. Algoritm A implements te policy. described in te text, and te resulting states, we also try eac action C times, and so on down to dept H in te tree. Zero values assigned to te leaves ten correspond to our estimates of ^V 0, wic are \backed-up" to nd estimates of ^V 1 for teir parents, wic are in turn backed-up to teir parents, and so on, up to te root to nd an estimate of ^V H (s 0). To complete te description of te algoritm, all tat remains is to coose te dept H, dept, and C, wic controls te widt of te tree. Bounding te required dept H is te easy and standard part. It is not ard to see tat if we coose dept H = log (1, )=R max (te so-called -orizon time), ten te discounted sum of te rewards tat is obtained by considering rewards beyond tis orizon is bounded by. However, suc a tree may still be as large as M itself, depending on te coice of C. For instance, if te nextstate distribution from s is uniform or nearly uniform over all te states in M, ten it would naively seem tat, in order to approximate te next-state distributions well, wewould need to take at least C = O(N) samples, if only to make sure we see most of possible next-states at least once in our samples. Te central claim we establis about C is tat it can be cosen independent of te number of states in M, yet still result in coosing near-optimal actions at te root. Te key to te argument is tat even toug small samples may give very poor approximations to te next-state distribution at eac state in te tree, tey will, neverteless, give good estimates of te expectation terms of Equation (7), and tat is really all we need. For tis we apply a careful combination of uniform convergence metods and inductive arguments on te tree dept. Again, te tecnical details of te proof of Teorem 1 are sketced in Appendix A. Te resulting tree tus represents only a vanising fraction of all of te H-step pats starting from s 0 tat ave non-zero probability in te MDP tat is, te sparse look-aead tree covers only a vanising part of te full look-aead tree. In tis sense, our algoritm is clearly related to and inspired by classical lookaead searc tecniques [RN95] our main contribution is in sowing tat in very large stocastic environments, clever random sampling suces to reconstruct nearly all of te information available in te (exponentially or in- nitely) large full look-aead tree. Note tat in te case of deterministic environments, were from eac stateaction pair we can reac only a single next state, te sparse and full trees coincide (assuming a memoization trick described below), and our algoritm reduces to classical deterministic look-aead searc.

5 a1... a1 a a1 a a1 a a1 a a a s0 a a1 a a1 a Dept H Figure : Sparse look-aead tree of states constructed by te algoritm. (Sown wit C = 3, actions a 1, a.) 3. Practical Issues and Lower Bounds Even toug te running time of algoritm A does not depend on te size of te MDP, it still runs in time exponential in te -orizon time H, and terefore exponential in 1=(1, ). It would seem tat te algoritm would be practical only if is not too close to 1. Neverteless, tere are a couple of simple tricks tat may elp to reduce te running time in certain cases. Te rst idea is simply to use memoization in our subroutines for calculating te ^V (s)'s. In Figure, tis means tat wenever tere are two nodes at te same level of te tree tat correspond to te same state, we collapse tem into one node (keeping just one of teir subtrees). Wile it is straigtforward to sow te correctness of suc memoization procedures for deterministic procedures, one sould be careful wen addressing randomized procedures; we can sow tat te properties of te algoritm are maintained under tis optimization (details are deferred to te full version of te paper). In implementing te algoritm, one may also wis not to specify in advance, but rater just try to do as well as is possible wit te computational resources available, in wic case an \iterative-deepening" approac maybe taken. In our case, tis would entail simultaneously increasing C and H by decreasing te target. Also, as studied in Davies et. al. [DNM98], ifweave access to an initial estimate of te value function, we can replace our estimates ^V 0 (s) = 0 at te leaves wit te estimated value function at tose states. Toug we sall not do so ere, it is again easy to make formal performance guarantees depending on C, H and te supremum error of te value function estimate we are using. Unfortunately, despite tese tricks, it is not dicult to prove alower bound tat sows tat any planning algoritm wit access only to a generative model, and wic implements a policy tat is -close to optimal in a general MDP, must ave running time at least exponential in te -orizon time. 4 Summary and Related Work We ave described an algoritm for near-optimal planning from a generative model, tat as a per-state running time tat does not depend on te size of te state space, but wic is still exponential in te -orizon time. Two interesting directions for improvement are to allow partially observable MDPs, and to nd more ecient algoritms tat do not ave exponential dependence on te orizon time. As a rst step towards bot of tese goals, in a separate paper we investigate a framework in wic te goal is to use a generative model to nd a near-best strategy witin a restricted class of strategies for a POMDP. Typical examples of suc restricted strategy classes include limited-memory strategies in POMDPs, or policies in large MDPs tat implement a linear mapping from state vectors to actions. Our main result in tis framework says tat as long as te restricted class of strategies is not too \complex" (were tis is formalized using appropriate generalizations of standard notions like VC dimension from supervised learning), ten it is possible to nd a nearbest strategy from witin te class, in time tat again as no dependence on te size of te state space. If te restricted class of strategies is smootly parameterized, ten tis furter leads to a number of fast, practical algoritms for doing gradient descent to nd te nearbest strategy witin te class, were te running time of eac gradient descent step now as only linear rater tan exponential dependence on te orizon time. References [AHU74] A.V. Ao, J.E. Hopcroft, and J.D. Ullman. Te Design and Analysis of Computer Algoritms. Addison-Wesley, [DNM98] Scott Davies, Andrew Y. Ng, and Andrew Moore. Applying online-searc to reinforcement learning. In Proceedings of AAAI-98, pages 753{760. AAAI Press, [KS99] Micael Kearns and Satinder Sing. Finitesample convergence rates for Q-learning and indirect algoritms. In Neural Information Processing Systems 1. MIT Press, (to appear), [MHK + 98] N. Meuleau, M. Hauskrect, K-E. Kim, L. Peskin, L.P. Kaelbling, T. Dean, and C. Boutilier. Solving very large weakly coupled Markov decision processes. In Proceedings of AAAI, pages 165{17, [RN95] S. Russell and P. Norvig. Articial Intelligence AModern Approac. Prentice Hall, [SB98] Ricard S. Sutton and Andrew G. Barto. Reinforcement Learning. MIT Press, [SY94] Satinder Sing and Ricard Yee. An upper bound on te loss from approximate optimalvalue functions. Macine Learning, 16:7{33, Appendix A: Proof Sketc of Teorem 1 In tis appendix, we sketc te proof of Teorem 1. Trougout te analysis we will rely on te pseudo-code provided for algoritm A given in Figure 1. Te claim on te running time is immediate from te denition of algoritm A. Eac call to EstimateQ generates kc calls to EstimateV, C calls for eac action.

6 Eac recursive call also reduces te dept parameter by one, so te dept of te recursion is at most H. Terefore te running time is O((kC) H ). Te main eort is in sowing tat te values of EstimateQ are indeed good estimates of Q for te cosen values of C and H. Tere are two sources of inaccuracy in tese estimates. Te rst is tat we use only a nite sample to approximate an expectation we draw only C states from te next-state distributions. Te second source of inaccuracy is tat in computing EstimateQ, we are not actually using te values of V () but rater values returned by EstimateV, wic are temselves only estimates. Te crucial step in te proof is to sow tat as increases, te overall inaccuracy decreases. Let us rst dene an intermediate random variable tat will capture te inaccuracy due to te limited sampling. Dene U (s; a) as follows: U (s; a) =R sa + 1 C CX V (s i ) (9) were te s i are drawn according to P sa (). Note tat U (s; a) isaveraging values of V (), te unknown value function. Since U (s; a) is used only for te proof and not in te algoritm, tere is no problem in dening it tis way. Te next lemma (proof omitted) sows tat wit ig probability, te dierence between U (s; a) and Q (s; a) is at most. Lemma For any state s and action a, wit probability at least 1, e, C=V max we ave jq (s; a), U (s; a)j = E spsa()[v (s)], 1 C X i V (s i ) ; were te probability is taken over te draw of te s i from P sa (). Now tat we ave quantied te error due to nite sampling, we can bound te error from our using values returned by EstimateV rater tan V (). We bound tis error as te dierence between U (s; a) and EstimateV. In order to make our notation simpler, let V n (s) be te value returned by EstimateV(n; C; ; G; s), and let Q n (s; a) be te component in te output of EstimateQ(n; C; ; G; s) tat corresponds to action a. Using tis notation, our algoritm computes Q n (s; a) =R sa + 1 C CX V n,1 (s i ) (10) were V n,1 (s) = max a fq n,1 (s; a)g, and Q 0 (s; a) =0 for every state s and action a. We now dene a parameter n tat will eventually bound te dierence between Q (s; a) and Q n (s; a). We dene n recursively: n+1 = ( + n ) (11) were 0 = V max. Solving for H we obtain H = HX i! + H V max 1, + H V max : (1) Te next lemma (proof omitted) bounds te error in te estimation, at level n, by n. Intuitively, te error due to nite sampling contributes, wile te errors in estimation contribute n. Te combined error is + n, but since we are discounting, te eective error is only ( + n ), wic by denition is n+1. Lemma 3 Wit probability at least 1, (kc) n e, C=V max we ave tat jq (s; a), Q n (s; a)j n : (13) From H H V max + =(1, ), we also see tat for H = log (=V max ), wit probability 1, (kc) H e, C=V max all te nal estimates Q H (s 0 ;a) are witin =(1, ) from te true Q-values. Te next step is to coose C suc tat = =R max (kc) H e, C=V max will bound te probability of a bad estimate during te entire computation. Specically, C = V max H log khv max + log 1 (14) is sucient to ensure tat wit probability 1, all te estimates are accurate. At tis point weave sown tat wit ig probability, algoritm A computes a good estimate of Q (s 0 ;a) for all a, were s 0 is te input state. To complete te proof, we need to relate tis to te expected value of a stocastic policy. We give a fairly general result about MDPs, wic does not depend on our specic algoritm. (A similar result appears in [SY94].) Lemma 4 Assume tat is a stocastic policy, so tat (s) is a random variable. If for eac state s, te probability tat Q (s; (s)),q (s; (s)) <is at least 1,, ten te discounted innite orizon return of is at most (+V max )=(1,) from te optimal return, tat is, for any state sv (s), V (s) ( +V max )=(1, ). Now we can combine all te lemmas to prove our main teorem. Proof of Teorem 1: As discussed before, te running time is immediate from te algoritm, and te main work is sowing tat we compute a near-optimal policy. By Lemma 3 we ave tat te error in te estimation of Q is at most H, wit probability1,(kc) H. Using te values we cose for C and H we ave tat wit probability1, te error is at most =(1, ). By Lemma 4 tis implies tat suc a policy as te property tat from every state s, V (s), V (s) (1, ) + V max 1, : (15) Substituting back te values of = =R max and = (1, ) =4 tat we ad cosen, it follows tat V (s), V 4 (s) = : (16) (1, )

tainty within the articial intelligence literature. The desire to attack problems of increasing complexity with this formalism has recently led resear

tainty within the articial intelligence literature. The desire to attack problems of increasing complexity with this formalism has recently led resear A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes Michael Kearns Syntek Capital mkearns@cis.upenn.edu Yishay Mansour Tel Aviv University mansour@math.tau.ac.il October