Sample Efficient Reinforcement Learning with Gaussian Processes

Size: px

Start display at page:

Download "Sample Efficient Reinforcement Learning with Gaussian Processes"

Jordan Francis
5 years ago
Views:

1 Robert C. Grande Thomas J. Walsh Jonathan P. How Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 039 USA Abstract This paper derives sample complexity results for using Gaussian Processes GPs in both modelbased and model-free reinforcement learning RL. We show that GPs are KWIK learnable, proving for the first time that a model-based RL approach using GPs, GP-Rmax, is sample efficient PAC-MDP. However, we then show that previous approaches to model-free RL using GPs take an exponential number of steps to find an optimal policy, and are therefore not sample efficient. The third and main contribution is the introduction of a model-free RL algorithm using GPs, DGPQ, which is sample efficient and, in contrast to model-based algorithms, capable of acting in real time, as demonstrated on a fivedimensional aircraft simulator.. Introduction In Reinforcement Learning RL Sutton & Barto, 998, several new algorithms for efficient exploration in continuous state spaces have been proposed, including GP-Rmax Jung & Stone, 00 and C-PACE Pazis & Parr, 03. In particular, C-PACE was shown to be PAC-MDP, an important class of RL algorithms that obtain an optimal policy in a polynomial number of exploration steps. However, these approaches require a costly fixed-point computation on each experience, making them ill-suited for real-time control of physical systems, such as aircraft. This paper presents a series of sample complexity PAC-MDP results for algorithms that use Gaussian Processes GPs Rasmussen & Williams, 006 in RL, culminating with the introduction of a PAC-MDP model-free algorithm which does not require this fixed-point computation and is better suited for real-time learning and control. Proceedings of the 3 st International Conference on Machine Learning, Beijing, China, 04. JMLR: W&CP volume 3. Copyright 04 by the authors. First, using the KWIK learning framework Li et al., 0, we provide the first-ever sample complexity analysis of GP learning under the conditions necessary for RL. The result of this analysis actually proves that the previously described model-based GP-Rmax is indeed PAC-MDP. However, it still cannot be used in real time, as mentioned above. Our second contribution is in model-free RL, in which a GP is used to directly model the Q-function. We show that existing model-free algorithms Engel et al., 005; Chung et al., 03 that use a single GP may require an exponential in the discount factor number of samples to reach a near-optimal policy. The third, and primary, contribution of this work is the introduction of the first model-free continuous state space PAC-MDP algorithm using GPs: Delayed-GPQ DGPQ. DGPQ represents the current value function as a GP, and updates a separately stored value function only when sufficient outlier data has been detected. This operation overwrites a portion of the stored value function and resets the GP confidence bounds, avoiding the slowed convergence rate of the naïve model-free approach. The underlying analogy is that while GP-Rmax and C- PACE are generalizations of model-based Rmax Brafman & Tennenholtz, 00, DGPQ is a generalization of modelfree Delayed Q-learning DQL Strehl et al., 006; 009 to general continuous spaces. That is, while early PAC- MDP RL algorithms were model-based and ran a planner after each experience Li et al., 0, DQL is a PAC-MDP algorithm that performs at most a single Bellman backup on each step. In DGPQ, the delayed updates of DQL are adapted to continuous state spaces using a GP. We prove that DGPQ is PAC-MDP, and show that using a sparse online GP implementation Csató & Opper, 00 DGPQ can perform in real-time. Our empirical results, including those on an F-6 simulator, show DGPQ is both sample efficient and orders of magnitude faster in persample computation than other PAC-MDP continuous-state learners.

2 . Background We now describe background material on Reinforcement Learning RL and Gaussian Processes GPs... Reinforcement Learning We model the RL environment as a Markov Decision Process Puterman, 994 M = S, A, R, T, γ with a potentially infinite set of states S, finite actions A, and 0 γ <. Each step elicits a reward Rs, a [0, R max ] and a stochastic transition to state s T s, a. Every MDP has an optimal value function Q s, a = Rs, a + γ T s, a, s V s where V s = max s a Q s, a and corresponding optimal policy π : S A. Note the bounded reward function means V [0, V max ]. In RL, an agent is given S,A, and γ and then acts in M with the goal of enacting π. For value-based methods as opposed to policy search Kalyanakrishnan & Stone, 009, there are roughly two classes of RL algorithms: modelbased and model-free. Model-based algorithms, such as KWIK-Rmax Li et al., 0, build models of T and R and then use a planner to find Q. Many model-based approaches have sample efficiency guarantees, that is bounds on the amount of exploration they perform. We use the PAC-MDP definition of sample complexity Strehl et al., 009. The definition uses definitions of the covering number of a set adapted from Pazis & Parr, 03. Definition The Covering Number N U r of a compact domain U R n is the cardinality of the minimal set C = {c i,..., c Nc } s.t. x U, c j C s.t. dx, c j r, where d, is some distance metric. Definition Algorithm A non-stationary policy A t with accuracy parameters ɛ and in an MDP of size N SA r is said to be Probably Approximately Correct for MDPs PAC-MDP if, with probability, it takes no more than a polynomial in N SA r, γ, ɛ, number of steps where V At s t < V s t ɛ. By contrast, model-free methods such as Q-Learning Watkins, 99 build Q directly from experience without explicitly representing T and R. Generally model-based methods are more sample efficient but require more computation time for the planner. Model-free methods are generally computationally light and can be applied without a planner, but need sometimes exponentially more samples, and are usually not PAC-MDP. There are also methods that are not easily classified in these categories, such as C-PACE Pazis & Parr, 03, which does not explicitly model T and R but performs a fixed-point operation for planning... Gaussian Processes This paper uses GPs as function approximators both in model-based RL where GPs represent T and R and model-free RL where GPs represent Q, while showing how to maintain PAC-MDP sample efficiency in both cases. A GP is defined as a collection of random variables, any finite subset of which has a joint Gaussian distribution Rasmussen & Williams, 006 with prior mean µx and covariance kernel kx, x. In this paper, we use the radial basis function RBF, kx, x = exp x x θ. The elements of the GP kernel matrix KX, X are defined as K i,j = kx i, x j, and kx, x denotes the kernel vector. The conditional probability distribution of the GP at new data point, x, can then be calculated as a normal variable Rasmussen & Williams, 006 with mean ˆµx = kx, x T [KX, X + ω ni] y, and covariance Σx = kx, x kx, x T [KX, X + ω ni] kx, x, where y is the set of observations, X is the set of observation input locations, and ω n is the measurement noise variance. For computational and memory efficiency, we employ an online sparse GP approximation Csató & Opper, 00 in the experiments..3. Related Work GPs have been used for both model-free and model-based RL. In model-free RL, GP-Sarsa Engel et al., 005 has been used to model a value function and extended to offpolicy learning Chowdhary et al., 04, and for heuristically better exploration in igp-sarsa Chung et al., 03. However, we prove in Section 4. that these algorithms may require an exponential in γ number of samples to reach optimal behavior since they only use a single GP to model the value function. In model-based RL, the PILCO algorithm trained GPs to represent T, and then derived policies using policy search Deisenroth & Rasmussen, 0. However, PILCO does not include a provably efficient PAC-MDP exploration strategy. GP-Rmax Jung & Stone, 00 does include an exploration strategy, specifically replacing areas of low confidence in the T and R GPs with high valued states, but no theoretical results are given in that paper. In Section 3, we show that GP-Rmax actually is PAC-MDP, but the algorithm s planning phase comparable to the C-PACE planning in our experiments after each update makes it computationally infeasible for real-time control. REKWIRE Li & Littman, 00 uses KWIK linear regression for a model-free PAC-MDP algorithm. However, REKWIRE needs H separate approximators for finite horizon H, and assumes Q can be modeled as a linear function, in contrast to the general continuous functions considered in our work. The C-PACE algorithm Pazis & Parr, 03

3 has already been shown to be PAC-MDP in the continuous setting, though it does not use a GP representation. C-PACE stores data points that do not have close-enough neighbors to be considered known. When it adds a new data point, the Q-values of each point are calculated by a value-iteration like operation. The computation for this operation grows with the number of datapoints and so as shown in our experiments the algorithm may not be able to act in real time. 3. GPs for Model-Based RL In model-based RL MBRL, models of T and R are created from data and a planner derives π. Here we review the KWIK-Rmax MBRL architecture, which is PAC-MDP. We then show that GPs are a KWIK-learnable representation of T and R, thereby proving that GP-Rmax Jung & Stone, 00 is PAC-MDP given an exact planner. 3.. KWIK Learning and Exploration The Knows What it Knows KWIK framework Li et al., 0 is a supervised learning framework for measuring the number of times a learner will admit uncertainty. A KWIK learning agent is given a hypothesis class H : X Y for inputs X and outputs Y and parameters ɛ and. With probability over a run, when given an adversarially chosen input x t, the agent must, either predict ŷ t if y t ŷ t ɛ where y t is the true expected output E[h x] = y t or admit I don t know denoted. A representation is KWIK learnable if an agent can KWIK learn any h H with only a polynomial in H, ɛ, number of predictions. In RL, the KWIK-Rmax algorithm Li et al., 0, uses KWIK learners to model T and R and replaces the value of any Q s, a where T s, a or Rs, a is with a value of V max = Rmax γ. If T and R are KWIK-learnable, then the resulting KWIK-Rmax RL algorithm will be PAC-MDP under Definition. We now show that a GP is KWIK learnable. 3.. KWIK Learning a GP First, we derive a metric for determining if a GP s mean prediction at input x is ɛ-accurate with high probability, based on the variance estimate at x. Lemma Consider a GP trained on samples y = [y,..., y t ] which are drawn from py x at input locations X = [x,..., x t ], with E[y x] = fx and y i [0, V m ]. If the predictive variance of the GP at x i X is σ x i σ tol = ω nɛ V m log then the prediction error at x i is bounded in probability: P r { ˆµx i fx i ɛ }. Proof sketch Here we ignore the effect of the GP-prior, which is considered in depth in the Supplementary material. Using McDiarmid s inequality we have that P r{ µx f x ɛ } exp ɛ c, where c i is i the maximum amount that y i can alter the prediction value µx. Using algebraic manipulation, Bayes law, and noting that the max singular value σkx, XKX, X + ω I <, it can be shown that c i σ x i Vm/ω. Substituting σ x i to McDiarmid s inequality and rearranging yields. See supplemental material. Lemma gives a sufficient condition to ensure that, with high probability, the mean of the GP is within ɛ of E[h x]. A KWIK agent can now be constructed that predicts if the variance is greater than σtol and otherwise predicts ŷ t = µx, the mean prediction of the GP. Theorem bounds the number of predictions by such an agent, and when = N U r σ tol, establishes the KWIK learnability of GPs. For ease of exposition, we define the equivalent distance as follows: Definition 3 The equivalent distance rɛ tol is the maximal distance s.t. x, c U, if dx, c rɛ tol, then the linear independence test γx, c = kx, x kx, c /kc, c ɛ tol. Theorem Consider a GP model trained over a compact domain U R n with covering number N U r σ tol. Let the observations y = [y,..., y n ] be drawn from py x, with E[y x] = fx, and x drawn adversarially. Then, the worst case bound on the number of samples for which P r { ˆµx fx ɛ }, x U and the GP is forced to return is at most 4V m = m log N U r ɛ σ tol. 3 Furthermore, N U r σ tol grows polynomially with ɛ and V m for the RBF kernel. Proof sketch If σ x σtol, x U, then follows from Lemma. By partitioning the domain into the Voronoi regions around the covering set C, it can be shown using the covariance equation see supplemental material, that given n V samples in a Voronoi region c i C, σ x n V σ tol +ω n V +ω everywhere in the region. Solving for n V, we find that n V ω is sufficient to σtol drive σ x σtol. Plugging in n V into, we have V m that n V = log ɛ points reduce the variance at c i below σtol. Therefore, the total number of points we can sample anywhere in U before reducing the variance below

4 σtol everywhere is equal to the sum of points n V over all regions, N U. The total probability of an incorrect prediction is given by the union bound of an incorrect prediction in any region, c i = N U =. See supplemental material for proof that N U grows polynomially with ɛ and V m. Intuitively, the KWIK learnability of GPs can be explained as follows. By knowing the value at a certain point within some tolerance ɛ, the Lipschitz smoothness assumption means there is a nonempty region around this point where values are known within a larger tolerance ɛ. Therefore, given sufficient observations in a neighborhood, a GP is able to generalize its learned values to other nearby points. Lemma relates the error at any of these points to the predictive variance of the GP, so a KWIK agent using a GP can use the variance prediction to choose whether to predict or not. As a function becomes less smooth, the size of these neighborhoods shrinks, increasing the covering number, and the number of points required to learn over the entire input space increases. Theorem, combined with the KWIK-Rmax Theorem Theorem 3 of Li et al., 0 establishes that the previously proposed GP-Rmax algorithm Jung & Stone, 00, which learns GP representations of T and R and replaces uncertainty with V max values, is indeed PAC-MDP. GP- Rmax was empirically validated in many domains in this previous work but the PAC-MDP property was not formally proven. However, GP-Rmax relies on a planner that can derive Q from the learned T and R, which may be infeasible in continuous domains, especially for real-time control. 4. GPs for Model-Free RL Model-free RL algorithms, such as Q-learning Watkins, 99, are often used when planning is infeasible due to time constraints or the size of the state space. These algorithms do not store T and R but instead try to model Q directly through incremental updates of the form: Q t+ s, a = αq t s, a + α r t + γvt s t+. Unfortunately, most Q-learning variants have provably exponential sample complexity, including optimistic/greedy Q- learning with a linearly decaying learning rate Even-Dar & Mansour, 004, due to the incremental updates to Q combined with the decaying α term. However, the more recent Delayed Q-learning DQL Strehl et al., 006 is PAC-MDP. This algorithm works by initializing Qs, a to V max and then overwriting Qs, a = m i= ri+γv s i m after m samples of that s, a pair have been seen. In section 4., we show that using a single GP results in exponential sample complexity, like Q-learning; in section 4., we show that using a similar overwriting approach to DQL achieves sample efficiency. 4.. Naïve Model-free Learning using GPs We now show that a naïve use of a single GP to model Q will result in exponentially slow convergence similar to greedy Q-learning with a linearly decaying α. Consider the following model-free algorithm using a GP to store values of ˆQ = Q t. A GP for each action GP a is initialized optimistically using the prior. At each step, an action is chosen greedily and GP a is updated with an input/output sample: x t, r t + γ max b GP b s t+. We analyze the worst-case performance through a toy example and show that this approach requires an exponential number of samples to learn Q. This slow convergence is intrinsic to GPs due to the variance reduction rate and the non-stationarity of the ˆQ estimate as opposed to T and R, whose sampled values do not depend on the learner s current estimates. So, any model-free algorithm using a GP that does not reinitialize the variance will have the same worst-case complexity as in this toy example. Consider an MDP with one state s and one action a that transitions deterministically back to s with reward r = 0, and discount factor γ. The Q function is initialized optimistically to ˆQ 0 s = using the GP s prior mean. Consider the naïve GP learning algorithm described above, kernel ks, s =, and measurement noise ω to predict the Q-function using the GP regression equations. We can analyze the behavior of the algorithm using induction. Consider the first iteration: ˆQ0 =, and the first measurement is γ. In this case, the GP prediction is equivalent to the MAP estimate of a random variable with Gaussian prior and linear measurements subject to Gaussian noise. ˆQ = ω Recursively, ˆQi+ = ω σ0 + + σ 0 ω σ0 γ 4 + ω σ i +ω + σ i σ i +ω γ ˆQ i where σ i = ω σ 0 is the GP variance at iteration i. Substituting for σ ω +iσ0 i and rearranging yields a recursion for the prediction of ˆQ i at each iteration, ˆQ i+ = ω + σ 0i + γ ω + σ 0 i + ˆQ i 5 From Even-Dar & Mansour, 004, we have that a series of the form ˆQ i+ = i+γ ˆQ i+ i will converge to ɛ exponentially slowly in terms of ɛ and γ. However, for each term in our series, we have ω + i + γ σ0 ω σ 0 + i + i + γ i + The modulus of contraction is always at least as large as the RHS, so the series convergence to ɛ since the true 6

5 Action Chosen.5.5 Action taken for Action MDP Greedy GP ε Greedy GP DGPQ Step Figure. Single-state MDP results: DGPQ converges quickly to the optimal policy while the naïve GP implementations oscillate. value is 0 is at least as slow as the example in Even-Dar & Mansour, 004. Therefore, the naïve GP implementation s convergence to ɛ is also exponentially slow, and, in fact, has the same learning speed as Q-learning with a linear learning rate. This is because the variance of a GP decays linearly with number of observed data points, and the magnitude of GP updates is proportional to this variance. Additionally, by adding a second action with reward r = γ, a greedy naïve agent would oscillate between the two actions for an exponential in γ number of steps, as shown in Figure blue line, meaning the algorithm is not PAC-MDP, since the other action is not near optimal. Randomness in the policy with ɛ-greedy exploration green line will not improve the slowness, which is due to the non-stationarity of the Q-function and the decaying update magnitudes of the GP. Many existing model-free GP algorithms, including GP-Sarsa Engel et al., 005, igp-sarsa Chung et al., 03, and non-delayed GPQ Chowdhary et al., 04 perform exactly such GP updates without variance reinitialization, and therefore have the same worstcase exponential sample complexity. 4.. Delayed GPQ for model-free RL We now propose a new algorithm, inspired by Delayed Q-learning, that guarantees polynomial sample efficiency by consistently reinitializing the variance of the GP when many observations differ significantly from the current ˆQ estimate. Delayed GPQ-Learning DGPQ: Algorithm maintains two representations of the value function. The first is a set of GPs, GP a for each action, that is updated after each step but not used for action selection. The second representation of the value function, which stores values from previously converged GPs is denoted ˆQs, a. Intuitively, the algorithm uses ˆQ as a safe version of the value function for choosing actions and retrieving backup values for the GP Algorithm Delayed GPQ DGPQ : Input: GP kernel k,, Lipschitz Constant L Q, Environment Env, Actions A, initial state s 0, discount γ, threshold σtol, ɛ : for a A do 3: ˆQa = 4: GP a = GP.initµ = Rmax γ, k, 5: for each timestep t do 6: a t = arg max a ˆQa s t by Eq 7 7: r t, s t+ = Env.takeActa t 8: q t = r t + γ max a ˆQa s t+ 9: σ = GP at.variances t 0: if σ > σtol then : GP at.updates t, q t : σ = GP at.variances t 3: if σ > σtol σ and ˆQ at s t GP at.means t > ɛ then 4: ˆQa.updates t, GP at.means t + ɛ 5: a A, GP a = GP.initµ = ˆQ a, k, updates, and only updates ˆQs, a when GP a converges to a significantly different value at s. The algorithm chooses actions greedily based on ˆQ line 6 and updates the corresponding GP a based on the observed rewards and ˆQ at next the state line 8. Note, in practice one creates a prior of ˆQ a s t line 5 by updating the GP with points z t = q t ˆQ a s t line, and adding back ˆQ a s t in the update on line 4. If the GP has just crossed the convergence threshold at point s and learned a value significantly lower ɛ than the current value of ˆQ line 3, the representation of ˆQ is partially overwritten with this new value plus a bonus term using an operation described below. Crucially, the GPs are then reset, with all data erased. This reinitializes the variance of the GPs so that updates will have a large magnitude, avoiding the slowed convergence seen in the previous section. Guidelines for setting the sensitivity of the two tests σ tol and ɛ are given in Section 5. There are significant parallels between DGPQ and the discrete-state DQL algorithm Strehl et al., 009, but the advancements are non-trivial. First, both algorithms maintain two Q functions, one for choosing actions ˆQ, and a temporary function that is updated on each step GP a, but in DGPQ we have chosen specific representations to handle continuous states and still maintain optimism and sample complexity guarantees. DQL s counting of up to m discrete state visits for each state cannot be used in continuous spaces, so DGPQ instead checks the immediate convergence of GP a at s t to determine if it should compare ˆQs t, a and GP a s t. Lastly, DQL s discrete learning flags are not applicable in continuous spaces, so DGPQ only compares the two functions as the GP variance crosses

6 a threshold, thereby partitioning the continuous MDP into areas that are known w.r.t. ˆQ and unknown, a property that will be vital for proving the algorithm s convergence. One might be tempted to use yet another GP to model ˆQs, a. However, it is difficult to guarantee the optimism of ˆQ ˆQ Q ɛ using a GP, which is a crucial property of most sample-efficient algorithms. In particular, if one attempts to update/overwrite the local prediction values of a GP, unless the kernel has finite support a point s value only influences a finite region, the overwrite will affect the values of all points in the GP and possibly cause a point that was previously correct to fall below Q s, a ɛ. In order to address this issue, we use an alternative function approximator for ˆQ which includes an optimism bonus as used in C-PACE Pazis & Parr, 03. Specifically, ˆQs, a is stored using a set of values that have been updated from the set of GP s, s i, a i, ˆµ i and a Lipschitz constant L Q that is used to find an optimistic upper bound for the Q function for s, a : { } ˆQs, a = min min ˆµ i+l Qds, a, s i, a, V max s i,a BV We refer to the set s i, a as the set of basis vectors BV. Intuitively, the basis vectors store values from the previously learned GPs. Around these points, Q cannot be greater than ˆµ i + L Q ds, a, s i, a by continuity. To predict optimistically at points not in BV, we search over BV for the point with the lowest prediction including the weighted distance bonus. If no point in BV is sufficiently close, then V max is used instead. This representation is also used in C-PACE Pazis & Parr, 03 but here it simply stores the optimistic Q-function; we do not perform a fixed point operation. Note the GP still plays a crucial role in Algorithm because its confidence bounds, which are not captured by ˆQ, partition the space into known/unknown areas that are critical for controlling updates to ˆQ. In order to perform an update partial overwrite of ˆQ, we add an element s i, a i, ˆµ i to the basis vector set. Redundant constraints are eliminated by checking if the new constraint results in a lower prediction value at other basis vector locations. Thus, the pseudocode for updating ˆQ is as follows: add point s i, a i, ˆµ i to the basis vector set; if for any j, µ i +L Q ds i, a, s j, a µ j, delete s j, a j, ˆµ j from the set. Figure shows the advantage of this technique over the naïve GP training discussed earlier. DGPQ learns the optimal action within 50 steps, while the naïve implementation as well as an ɛ-greedy variant both oscillate between the two actions. This example illustrates that the targeted exploration of DGPQ is of significant benefit compared to untargeted approaches, including the ɛ-greedy approach of GP-SARSA The Sample Complexity of DGPQ In order to prove that DGPQ is PAC-MDP we adopt a proof structure similar to that of DQL Strehl et al., 009 and refer throughout to corresponding theorems and lemmas. First we extend the DQL definition of a known state MDP M Kt, containing state/actions from ˆQ that have low Bellman residuals. Definition 4 During timestep t of DGPQ s execution with ˆQ as specified and ˆV s = max a ˆQs, a, the set of known states is given by K t = { s, a ˆQs, a Rs, a + γ s T s s, a ˆV s ds 3ɛ }. M K contains the same MDP parameters as M for s, a K and values of V max for s, a / K. We now recap from Theorem 0 of Strehl et al., 009 three sufficient conditions for proving an algorithm with greedy policy values V t is PAC-MDP. Optimism: ˆV t s V s ɛ for all timesteps t. Accuracy with respect to M Kt s values: ˆVt s V πt M Kt s ɛ for all t. 3 The number of updates to ˆQ and the number of times a state outside M K is reached is bounded by a polynomial function of N S r, ɛ,, γ. The proof structure is to develop lemmas that prove these three properties. After defining relevant structures and concepts, Lemmas and 3 bound the number of possible changes and possible attempts to change ˆQ. We then define the typical behavior of the algorithm in Definition 6 and show this behavior occurs with high probability in Lemma 4. These conditions help ensure property above. After that, Lemma 5 shows that the function stored in ˆQ is always optimistic, fulfilling property. Finally, property 3 is shown by combining Theorem number of steps before the GP converges with the number of updates to ˆQ from Lemma, as formalized in Lemma 7. We now give the definition of an update as well as successful update to ˆQ, which extends definitions from the original DQL analysis. Definition 5 An update or successful update of stateaction pair s, a is a timestep t for which a change to ˆQ an overwrite occurs such that ˆQ t s, a ˆQ t+ s, a > ɛ. An attempted update of state-action pair s,a is a timestep t for which s, a is experienced and GP a,t.vars > σ tol GP a,t+.vars. An attempted update that is not successful does not change ˆQ is an unsuccessful update. We bound the total number of updates to ˆQ by a polynomial term κ based on the concepts defined above. Lemma The total number of successful updates overwrites during any execution of DGPQ with ɛ = 3 ɛ γ

7 is ɛ γ 3Rmax κ = A N S 3L Q γ ɛ + Proof The proof proceeds by showing that the bound on the number of updates in the single-state case from Section 4. is, max 3R γ ɛ + based on driving the full function from V max to V min. This quantity is then multiplied by the covering number and the number of actions. See the Supplementary Material. The next lemma bounds the number of attempted updates. The proof is similar to the DQL analysis and shown in the Supplementary Material. Lemma 3 The total number of attempted updates overwrites during any execution of GPQ is A N S ɛ γ 3L Q + κ. We now define the following event, called A to link back to the DQL proof. A describes the situation where a state/action that currently has an inaccurate value high Bellman residual with respect to ˆQ is observed m times and this causes a successful update. Definition 6 Define Event A to be the event that for all timesteps t, if s, a / K k and an attempted update of s, a occurs during timestep t, the update will be successful, where k < k <... < k m = t are the timesteps where GP ak.vars k > σtol since the last update to ˆQ that affected s, a. We can set an upper bound on m, the number of experiences required to make GP at.vars t < σtol, based on m in Theorem from earlier. In practice m will be much smaller but unlike discrete DQL, DGPQ does not need to be given m, since it uses the GP variance estimate to decide if sufficient data has been collected. The following lemma shows that with high probability, a failed update will not occur under the conditions of event A. Lemma 4 By setting σtol ω = nɛ γ 4 9Rmax log 6 A N ɛ γ S 3L Q + κ we ensure that the probability of event A occurring is /3. Proof The maximum number of attempted updates is A N ɛ γ S 3L Q + κ from Lemma 3. Therefore, by setting =, ɛ = ɛ γ, and 3 ɛ γ 3 A N S +κ 3L Q 8 9 V m = Rmax, we have from Lemma that during each overwrite, γ there is no greater than probability = ɛ γ 3 A N S +κ 3L Q of an incorrect update. Applying the union bound over all of the possible attempted updates, we have the total probability of A not occuring is. 3 The next lemma shows the optimism of ˆQ for all timesteps with high probability 3. The proof structure is similar to Lemma 3.0 of Pazis & Parr, 03. See supplemental material. Lemma 5 During execution of DGPQ, Q s, a Q t s, a + ɛ γ holds for all s, a with probability 3. The next Lemma connects an unsuccessful update to a state/action s presence in the known MDP M K. Lemma 6 If event A occurs, then if an unsuccessful update occurs at time t and GP a.vars < σtol at time t + then s, a K t+. Proof The proof is by contradiction and shows that an unsuccessful update in this case implies a previously successful update that would have left GP a.vars σtol. See the Supplemental material. The final lemma bounds the number of encounters with state/actions not in K t, which is intuitively the number of points that make updates to the GP from Theorem times the number of changes to ˆQ from Lemma. The proof is in the Supplemental material. Lemma 7 Let η = N ɛ γ S 3L Q. If event A occurs and ˆQ t s, a Q s, a ɛ γ holds for all t and s, a then the number of timesteps ζ where s t, a t / K t is at most 3Rmax ζ = m A η γ ɛ + 0 where m = 36R max 6 γ 4 ɛ log A η + κ A η Finally, we state the PAC-MDP result for DGPQ, which is an instantiation of the General PAC-MDP Theorem Theorem 0 from Strehl et al., 009. Theorem Given real numbers 0 < ɛ < γ and 0 < < and a continuous MDP M there exist inputs σtol see 9 and ɛ = 3ɛ γ such that DGPQ executed on M will be PAC-MDP by Definition with only R maxζ ɛ γ log log ɛ γ

8 Steps to goal CPACE DGPQ Episode CPU Time CPACE DGPQ Step Figure. Average 0 runs steps to the goal and computation time for C-PACE and DGPQ on the square domain. timesteps where Q ts t, a t < V s t ɛ, where ζ is defined in Equation 0. The proof of the theorem is the same as Theorem 6 by Strehl et al., 009, but with the updated lemmas from above. The three crucial properties hold when A occurs, which Lemma 4 guarantees with probability 3. Property optimism holds from Lemma 5. Property accuracy holds from the Definition 4 and analysis of the Bellman Equation as in Theorem 6 by Strehl et al., 009. Finally, property 3, the bounded number of updates and escape events, is proven by Lemmas and Empirical Results Our first experiment is in a -dimensional square over [0, ] designed to show the computational disparity between C-PACE and DGPQ. The agent starts at [0, 0] with a goal of reaching within a distance of 0.5 of [, ]. Movements are 0. in the four compass directions with additive uniform noise of ±0.0. We used an L distance metric, L Q = 9, and an RBF kernel with θ = 0.05, ω n = 0. for the GP. Figure shows the number of steps needed per episode capped at 00 and computational time per step used by C-PACE and DGPQ. C-PACE reaches the optimal policy in fewer episodes but requires orders of magnitude more computation during steps with fixed point computations. Such planning times of over 0s are unacceptable in time sensitive domains, while DGPQ only takes 0.003s per step. In the second domain, we use DGPQ to stabilize a simulator of the longitudinal dynamics of an unstable F6 with linearized dynamics, which motivates the need for a real-time computable policy. The five dimensional state space contains height, angle of attack, pitch angle, pitch rate, and airspeed. Details of the simulator can be found in Stevens & Lewis, 003. We used the reward r = h h d /00ft ḣ /00ft/s e, with aircraft height h, desired height h d and elevator angle degrees e. The control input was discretized as e {, 0, } and the elevator was used to control the aircraft. The thrust input to the engine was fixed as the reference thrust command at Reward Averaged over 0 Trials Reward per Episode 500 L 600 Q = 5 L Q = Episode Figure 3. Average 0 runs reward on the F6 domain. the unstable equilibrium. The simulation time step size was 0.05s and at each step, the air speed was perturbed with Gaussian noise N 0, and the angle of attack was perturbed with Gaussian noise N 0, 0.0. A RBF kernel with θ = 0.05, ω n = 0. was used. The initial height was h d, and if h h d 00, then the vertical velocity and angle of attack were set to zero to act as boundaries. DGPQ learns to stabilize the aircraft using less than 00 episodes for L Q = 5 and about 000 episodes for L Q = 0. The disparity in learning speed is due to the curse of dimensionality. As the Lipschitz constant doubles, N c increases in each dimension by, resulting in a 5 increase in N c. DGPQ requires on average 0.04s to compute its policy at each step, which is within the 0Hz command frequency required by the simulator. While the maximum computation time for DGPQ was 0.s, the simulations were run in MATLAB so further optimization should be possible. Figure 3 shows the average reward of DGPQ in this domain using L Q = {5, 0}. We also ran C-PACE in this domain but its computation time reached over 60s per step in the first episode, well beyond the desired 0 Hz command rate. 7. Conclusions This paper provides sample efficiency results for using GPs in RL. In section 3, GPs are proven to be usable in the KWIK-Rmax MBRL architecture, establishing the previously proposed algorithm GP-Rmax as PAC-MDP. In section 4., we prove that existing model-free algorithms using a single GP have exponential sample complexity, connecting to seemingly unrelated negative results on Q-learning learning speeds. Finally, the development of DGPQ provides the first provably sample efficient model-free without a planner or fixed-point computation RL algorithm for general continuous spaces. Acknowledgments We thank Girish Chowdhary for helpful discussions, Hassan Kingravi for his GP implementation, and Aerojet- Rocketdyne and ONR #N for funding.

9 References Brafman, Ronen I. and Tennenholtz, Moshe. R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:3 3, 00. Chowdhary, Girish, Liu, Miao, Grande, Robert C., Walsh, Thomas J., How, Jonathan P., and Carin, Lawrence. Offpolicy reinforcement learning with gaussian processes. Acta Automatica Sinica, To appear, 04. Chung, Jen Jen, Lawrance, Nicholas R. J., and Sukkarieh, Salah. Gaussian processes for informative exploration in reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pp , 03. Csató, L. and Opper, M. Sparse on-line gaussian processes. Neural Computation, 43:64 668, 00. Deisenroth, Marc Peter and Rasmussen, Carl Edward. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the International Conference on Machine Learning ICML, pp , 0. Engel, Y., Mannor, S., and Meir, R. Reinforcement learning with Gaussian processes. In Proceedings of the International Conference on Machine Learning ICML, 005. Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 994. Rasmussen, C. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 006. Stevens, Brian L. and Lewis, Frank L. Aircraft control and simulation. Wiley-Interscience, 003. Strehl, Alexander L., Li, Lihong, Wiewiora, Eric, Langford, John, and Littman, Michael L. PAC model-free reinforcement learning. In Proceedings of the International Conference on Machine Learning ICML, pp , 006. Strehl, Alexander L., Li, Lihong, and Littman, Michael L. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 0:43 444, 009. Sutton, R. and Barto, A. Reinforcement Learning, an Introduction. MIT Press, Cambridge, MA, 998. Watkins, C. J. Q-learning. Machine Learning, 83:79 9, 99. Even-Dar, Eyal and Mansour, Yishay. Learning rates for Q-learning. Journal of Machine Learning Research, 5: 5, 004. Jung, Tobias and Stone, Peter. Gaussian processes for sample efficient reinforcement learning with RMAX-like exploration. In Proceedings of the European Conference on Machine Learning ECML, 00. Kalyanakrishnan, Shivaram and Stone, Peter. An empirical analysis of value function-based and policy search reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp , 009. Li, Lihong and Littman, Miachael L. Reducing reinforcement learning to KWIK online regression. Annals of Mathematics and Artificial Intelligence, 583-4:7 37, 00. Li, Lihong, Littman, Michael L., Walsh, Thomas J., and Strehl, Alexander L. Knows what it knows: a framework for self-aware learning. Machine Learning, 83: , 0. Pazis, Jason and Parr, Ronald. PAC optimal exploration in continuous space markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, 03.

Prioritized Sweeping Converges to the Optimal Value Function

Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science