Sample Efficient Reinforcement Learning with Gaussian Processes

Size: px
Start display at page:

Download "Sample Efficient Reinforcement Learning with Gaussian Processes"

Transcription

1 Robert C. Grande Thomas J. Walsh Jonathan P. How Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 039 USA Abstract This paper derives sample complexity results for using Gaussian Processes GPs in both modelbased and model-free reinforcement learning RL. We show that GPs are KWIK learnable, proving for the first time that a model-based RL approach using GPs, GP-Rmax, is sample efficient PAC-MDP. However, we then show that previous approaches to model-free RL using GPs take an exponential number of steps to find an optimal policy, and are therefore not sample efficient. The third and main contribution is the introduction of a model-free RL algorithm using GPs, DGPQ, which is sample efficient and, in contrast to model-based algorithms, capable of acting in real time, as demonstrated on a fivedimensional aircraft simulator.. Introduction In Reinforcement Learning RL Sutton & Barto, 998, several new algorithms for efficient exploration in continuous state spaces have been proposed, including GP-Rmax Jung & Stone, 00 and C-PACE Pazis & Parr, 03. In particular, C-PACE was shown to be PAC-MDP, an important class of RL algorithms that obtain an optimal policy in a polynomial number of exploration steps. However, these approaches require a costly fixed-point computation on each experience, making them ill-suited for real-time control of physical systems, such as aircraft. This paper presents a series of sample complexity PAC-MDP results for algorithms that use Gaussian Processes GPs Rasmussen & Williams, 006 in RL, culminating with the introduction of a PAC-MDP model-free algorithm which does not require this fixed-point computation and is better suited for real-time learning and control. Proceedings of the 3 st International Conference on Machine Learning, Beijing, China, 04. JMLR: W&CP volume 3. Copyright 04 by the authors. First, using the KWIK learning framework Li et al., 0, we provide the first-ever sample complexity analysis of GP learning under the conditions necessary for RL. The result of this analysis actually proves that the previously described model-based GP-Rmax is indeed PAC-MDP. However, it still cannot be used in real time, as mentioned above. Our second contribution is in model-free RL, in which a GP is used to directly model the Q-function. We show that existing model-free algorithms Engel et al., 005; Chung et al., 03 that use a single GP may require an exponential in the discount factor number of samples to reach a near-optimal policy. The third, and primary, contribution of this work is the introduction of the first model-free continuous state space PAC-MDP algorithm using GPs: Delayed-GPQ DGPQ. DGPQ represents the current value function as a GP, and updates a separately stored value function only when sufficient outlier data has been detected. This operation overwrites a portion of the stored value function and resets the GP confidence bounds, avoiding the slowed convergence rate of the naïve model-free approach. The underlying analogy is that while GP-Rmax and C- PACE are generalizations of model-based Rmax Brafman & Tennenholtz, 00, DGPQ is a generalization of modelfree Delayed Q-learning DQL Strehl et al., 006; 009 to general continuous spaces. That is, while early PAC- MDP RL algorithms were model-based and ran a planner after each experience Li et al., 0, DQL is a PAC-MDP algorithm that performs at most a single Bellman backup on each step. In DGPQ, the delayed updates of DQL are adapted to continuous state spaces using a GP. We prove that DGPQ is PAC-MDP, and show that using a sparse online GP implementation Csató & Opper, 00 DGPQ can perform in real-time. Our empirical results, including those on an F-6 simulator, show DGPQ is both sample efficient and orders of magnitude faster in persample computation than other PAC-MDP continuous-state learners.

2 . Background We now describe background material on Reinforcement Learning RL and Gaussian Processes GPs... Reinforcement Learning We model the RL environment as a Markov Decision Process Puterman, 994 M = S, A, R, T, γ with a potentially infinite set of states S, finite actions A, and 0 γ <. Each step elicits a reward Rs, a [0, R max ] and a stochastic transition to state s T s, a. Every MDP has an optimal value function Q s, a = Rs, a + γ T s, a, s V s where V s = max s a Q s, a and corresponding optimal policy π : S A. Note the bounded reward function means V [0, V max ]. In RL, an agent is given S,A, and γ and then acts in M with the goal of enacting π. For value-based methods as opposed to policy search Kalyanakrishnan & Stone, 009, there are roughly two classes of RL algorithms: modelbased and model-free. Model-based algorithms, such as KWIK-Rmax Li et al., 0, build models of T and R and then use a planner to find Q. Many model-based approaches have sample efficiency guarantees, that is bounds on the amount of exploration they perform. We use the PAC-MDP definition of sample complexity Strehl et al., 009. The definition uses definitions of the covering number of a set adapted from Pazis & Parr, 03. Definition The Covering Number N U r of a compact domain U R n is the cardinality of the minimal set C = {c i,..., c Nc } s.t. x U, c j C s.t. dx, c j r, where d, is some distance metric. Definition Algorithm A non-stationary policy A t with accuracy parameters ɛ and in an MDP of size N SA r is said to be Probably Approximately Correct for MDPs PAC-MDP if, with probability, it takes no more than a polynomial in N SA r, γ, ɛ, number of steps where V At s t < V s t ɛ. By contrast, model-free methods such as Q-Learning Watkins, 99 build Q directly from experience without explicitly representing T and R. Generally model-based methods are more sample efficient but require more computation time for the planner. Model-free methods are generally computationally light and can be applied without a planner, but need sometimes exponentially more samples, and are usually not PAC-MDP. There are also methods that are not easily classified in these categories, such as C-PACE Pazis & Parr, 03, which does not explicitly model T and R but performs a fixed-point operation for planning... Gaussian Processes This paper uses GPs as function approximators both in model-based RL where GPs represent T and R and model-free RL where GPs represent Q, while showing how to maintain PAC-MDP sample efficiency in both cases. A GP is defined as a collection of random variables, any finite subset of which has a joint Gaussian distribution Rasmussen & Williams, 006 with prior mean µx and covariance kernel kx, x. In this paper, we use the radial basis function RBF, kx, x = exp x x θ. The elements of the GP kernel matrix KX, X are defined as K i,j = kx i, x j, and kx, x denotes the kernel vector. The conditional probability distribution of the GP at new data point, x, can then be calculated as a normal variable Rasmussen & Williams, 006 with mean ˆµx = kx, x T [KX, X + ω ni] y, and covariance Σx = kx, x kx, x T [KX, X + ω ni] kx, x, where y is the set of observations, X is the set of observation input locations, and ω n is the measurement noise variance. For computational and memory efficiency, we employ an online sparse GP approximation Csató & Opper, 00 in the experiments..3. Related Work GPs have been used for both model-free and model-based RL. In model-free RL, GP-Sarsa Engel et al., 005 has been used to model a value function and extended to offpolicy learning Chowdhary et al., 04, and for heuristically better exploration in igp-sarsa Chung et al., 03. However, we prove in Section 4. that these algorithms may require an exponential in γ number of samples to reach optimal behavior since they only use a single GP to model the value function. In model-based RL, the PILCO algorithm trained GPs to represent T, and then derived policies using policy search Deisenroth & Rasmussen, 0. However, PILCO does not include a provably efficient PAC-MDP exploration strategy. GP-Rmax Jung & Stone, 00 does include an exploration strategy, specifically replacing areas of low confidence in the T and R GPs with high valued states, but no theoretical results are given in that paper. In Section 3, we show that GP-Rmax actually is PAC-MDP, but the algorithm s planning phase comparable to the C-PACE planning in our experiments after each update makes it computationally infeasible for real-time control. REKWIRE Li & Littman, 00 uses KWIK linear regression for a model-free PAC-MDP algorithm. However, REKWIRE needs H separate approximators for finite horizon H, and assumes Q can be modeled as a linear function, in contrast to the general continuous functions considered in our work. The C-PACE algorithm Pazis & Parr, 03

3 has already been shown to be PAC-MDP in the continuous setting, though it does not use a GP representation. C-PACE stores data points that do not have close-enough neighbors to be considered known. When it adds a new data point, the Q-values of each point are calculated by a value-iteration like operation. The computation for this operation grows with the number of datapoints and so as shown in our experiments the algorithm may not be able to act in real time. 3. GPs for Model-Based RL In model-based RL MBRL, models of T and R are created from data and a planner derives π. Here we review the KWIK-Rmax MBRL architecture, which is PAC-MDP. We then show that GPs are a KWIK-learnable representation of T and R, thereby proving that GP-Rmax Jung & Stone, 00 is PAC-MDP given an exact planner. 3.. KWIK Learning and Exploration The Knows What it Knows KWIK framework Li et al., 0 is a supervised learning framework for measuring the number of times a learner will admit uncertainty. A KWIK learning agent is given a hypothesis class H : X Y for inputs X and outputs Y and parameters ɛ and. With probability over a run, when given an adversarially chosen input x t, the agent must, either predict ŷ t if y t ŷ t ɛ where y t is the true expected output E[h x] = y t or admit I don t know denoted. A representation is KWIK learnable if an agent can KWIK learn any h H with only a polynomial in H, ɛ, number of predictions. In RL, the KWIK-Rmax algorithm Li et al., 0, uses KWIK learners to model T and R and replaces the value of any Q s, a where T s, a or Rs, a is with a value of V max = Rmax γ. If T and R are KWIK-learnable, then the resulting KWIK-Rmax RL algorithm will be PAC-MDP under Definition. We now show that a GP is KWIK learnable. 3.. KWIK Learning a GP First, we derive a metric for determining if a GP s mean prediction at input x is ɛ-accurate with high probability, based on the variance estimate at x. Lemma Consider a GP trained on samples y = [y,..., y t ] which are drawn from py x at input locations X = [x,..., x t ], with E[y x] = fx and y i [0, V m ]. If the predictive variance of the GP at x i X is σ x i σ tol = ω nɛ V m log then the prediction error at x i is bounded in probability: P r { ˆµx i fx i ɛ }. Proof sketch Here we ignore the effect of the GP-prior, which is considered in depth in the Supplementary material. Using McDiarmid s inequality we have that P r{ µx f x ɛ } exp ɛ c, where c i is i the maximum amount that y i can alter the prediction value µx. Using algebraic manipulation, Bayes law, and noting that the max singular value σkx, XKX, X + ω I <, it can be shown that c i σ x i Vm/ω. Substituting σ x i to McDiarmid s inequality and rearranging yields. See supplemental material. Lemma gives a sufficient condition to ensure that, with high probability, the mean of the GP is within ɛ of E[h x]. A KWIK agent can now be constructed that predicts if the variance is greater than σtol and otherwise predicts ŷ t = µx, the mean prediction of the GP. Theorem bounds the number of predictions by such an agent, and when = N U r σ tol, establishes the KWIK learnability of GPs. For ease of exposition, we define the equivalent distance as follows: Definition 3 The equivalent distance rɛ tol is the maximal distance s.t. x, c U, if dx, c rɛ tol, then the linear independence test γx, c = kx, x kx, c /kc, c ɛ tol. Theorem Consider a GP model trained over a compact domain U R n with covering number N U r σ tol. Let the observations y = [y,..., y n ] be drawn from py x, with E[y x] = fx, and x drawn adversarially. Then, the worst case bound on the number of samples for which P r { ˆµx fx ɛ }, x U and the GP is forced to return is at most 4V m = m log N U r ɛ σ tol. 3 Furthermore, N U r σ tol grows polynomially with ɛ and V m for the RBF kernel. Proof sketch If σ x σtol, x U, then follows from Lemma. By partitioning the domain into the Voronoi regions around the covering set C, it can be shown using the covariance equation see supplemental material, that given n V samples in a Voronoi region c i C, σ x n V σ tol +ω n V +ω everywhere in the region. Solving for n V, we find that n V ω is sufficient to σtol drive σ x σtol. Plugging in n V into, we have V m that n V = log ɛ points reduce the variance at c i below σtol. Therefore, the total number of points we can sample anywhere in U before reducing the variance below

4 σtol everywhere is equal to the sum of points n V over all regions, N U. The total probability of an incorrect prediction is given by the union bound of an incorrect prediction in any region, c i = N U =. See supplemental material for proof that N U grows polynomially with ɛ and V m. Intuitively, the KWIK learnability of GPs can be explained as follows. By knowing the value at a certain point within some tolerance ɛ, the Lipschitz smoothness assumption means there is a nonempty region around this point where values are known within a larger tolerance ɛ. Therefore, given sufficient observations in a neighborhood, a GP is able to generalize its learned values to other nearby points. Lemma relates the error at any of these points to the predictive variance of the GP, so a KWIK agent using a GP can use the variance prediction to choose whether to predict or not. As a function becomes less smooth, the size of these neighborhoods shrinks, increasing the covering number, and the number of points required to learn over the entire input space increases. Theorem, combined with the KWIK-Rmax Theorem Theorem 3 of Li et al., 0 establishes that the previously proposed GP-Rmax algorithm Jung & Stone, 00, which learns GP representations of T and R and replaces uncertainty with V max values, is indeed PAC-MDP. GP- Rmax was empirically validated in many domains in this previous work but the PAC-MDP property was not formally proven. However, GP-Rmax relies on a planner that can derive Q from the learned T and R, which may be infeasible in continuous domains, especially for real-time control. 4. GPs for Model-Free RL Model-free RL algorithms, such as Q-learning Watkins, 99, are often used when planning is infeasible due to time constraints or the size of the state space. These algorithms do not store T and R but instead try to model Q directly through incremental updates of the form: Q t+ s, a = αq t s, a + α r t + γvt s t+. Unfortunately, most Q-learning variants have provably exponential sample complexity, including optimistic/greedy Q- learning with a linearly decaying learning rate Even-Dar & Mansour, 004, due to the incremental updates to Q combined with the decaying α term. However, the more recent Delayed Q-learning DQL Strehl et al., 006 is PAC-MDP. This algorithm works by initializing Qs, a to V max and then overwriting Qs, a = m i= ri+γv s i m after m samples of that s, a pair have been seen. In section 4., we show that using a single GP results in exponential sample complexity, like Q-learning; in section 4., we show that using a similar overwriting approach to DQL achieves sample efficiency. 4.. Naïve Model-free Learning using GPs We now show that a naïve use of a single GP to model Q will result in exponentially slow convergence similar to greedy Q-learning with a linearly decaying α. Consider the following model-free algorithm using a GP to store values of ˆQ = Q t. A GP for each action GP a is initialized optimistically using the prior. At each step, an action is chosen greedily and GP a is updated with an input/output sample: x t, r t + γ max b GP b s t+. We analyze the worst-case performance through a toy example and show that this approach requires an exponential number of samples to learn Q. This slow convergence is intrinsic to GPs due to the variance reduction rate and the non-stationarity of the ˆQ estimate as opposed to T and R, whose sampled values do not depend on the learner s current estimates. So, any model-free algorithm using a GP that does not reinitialize the variance will have the same worst-case complexity as in this toy example. Consider an MDP with one state s and one action a that transitions deterministically back to s with reward r = 0, and discount factor γ. The Q function is initialized optimistically to ˆQ 0 s = using the GP s prior mean. Consider the naïve GP learning algorithm described above, kernel ks, s =, and measurement noise ω to predict the Q-function using the GP regression equations. We can analyze the behavior of the algorithm using induction. Consider the first iteration: ˆQ0 =, and the first measurement is γ. In this case, the GP prediction is equivalent to the MAP estimate of a random variable with Gaussian prior and linear measurements subject to Gaussian noise. ˆQ = ω Recursively, ˆQi+ = ω σ0 + + σ 0 ω σ0 γ 4 + ω σ i +ω + σ i σ i +ω γ ˆQ i where σ i = ω σ 0 is the GP variance at iteration i. Substituting for σ ω +iσ0 i and rearranging yields a recursion for the prediction of ˆQ i at each iteration, ˆQ i+ = ω + σ 0i + γ ω + σ 0 i + ˆQ i 5 From Even-Dar & Mansour, 004, we have that a series of the form ˆQ i+ = i+γ ˆQ i+ i will converge to ɛ exponentially slowly in terms of ɛ and γ. However, for each term in our series, we have ω + i + γ σ0 ω σ 0 + i + i + γ i + The modulus of contraction is always at least as large as the RHS, so the series convergence to ɛ since the true 6

5 Action Chosen.5.5 Action taken for Action MDP Greedy GP ε Greedy GP DGPQ Step Figure. Single-state MDP results: DGPQ converges quickly to the optimal policy while the naïve GP implementations oscillate. value is 0 is at least as slow as the example in Even-Dar & Mansour, 004. Therefore, the naïve GP implementation s convergence to ɛ is also exponentially slow, and, in fact, has the same learning speed as Q-learning with a linear learning rate. This is because the variance of a GP decays linearly with number of observed data points, and the magnitude of GP updates is proportional to this variance. Additionally, by adding a second action with reward r = γ, a greedy naïve agent would oscillate between the two actions for an exponential in γ number of steps, as shown in Figure blue line, meaning the algorithm is not PAC-MDP, since the other action is not near optimal. Randomness in the policy with ɛ-greedy exploration green line will not improve the slowness, which is due to the non-stationarity of the Q-function and the decaying update magnitudes of the GP. Many existing model-free GP algorithms, including GP-Sarsa Engel et al., 005, igp-sarsa Chung et al., 03, and non-delayed GPQ Chowdhary et al., 04 perform exactly such GP updates without variance reinitialization, and therefore have the same worstcase exponential sample complexity. 4.. Delayed GPQ for model-free RL We now propose a new algorithm, inspired by Delayed Q-learning, that guarantees polynomial sample efficiency by consistently reinitializing the variance of the GP when many observations differ significantly from the current ˆQ estimate. Delayed GPQ-Learning DGPQ: Algorithm maintains two representations of the value function. The first is a set of GPs, GP a for each action, that is updated after each step but not used for action selection. The second representation of the value function, which stores values from previously converged GPs is denoted ˆQs, a. Intuitively, the algorithm uses ˆQ as a safe version of the value function for choosing actions and retrieving backup values for the GP Algorithm Delayed GPQ DGPQ : Input: GP kernel k,, Lipschitz Constant L Q, Environment Env, Actions A, initial state s 0, discount γ, threshold σtol, ɛ : for a A do 3: ˆQa = 4: GP a = GP.initµ = Rmax γ, k, 5: for each timestep t do 6: a t = arg max a ˆQa s t by Eq 7 7: r t, s t+ = Env.takeActa t 8: q t = r t + γ max a ˆQa s t+ 9: σ = GP at.variances t 0: if σ > σtol then : GP at.updates t, q t : σ = GP at.variances t 3: if σ > σtol σ and ˆQ at s t GP at.means t > ɛ then 4: ˆQa.updates t, GP at.means t + ɛ 5: a A, GP a = GP.initµ = ˆQ a, k, updates, and only updates ˆQs, a when GP a converges to a significantly different value at s. The algorithm chooses actions greedily based on ˆQ line 6 and updates the corresponding GP a based on the observed rewards and ˆQ at next the state line 8. Note, in practice one creates a prior of ˆQ a s t line 5 by updating the GP with points z t = q t ˆQ a s t line, and adding back ˆQ a s t in the update on line 4. If the GP has just crossed the convergence threshold at point s and learned a value significantly lower ɛ than the current value of ˆQ line 3, the representation of ˆQ is partially overwritten with this new value plus a bonus term using an operation described below. Crucially, the GPs are then reset, with all data erased. This reinitializes the variance of the GPs so that updates will have a large magnitude, avoiding the slowed convergence seen in the previous section. Guidelines for setting the sensitivity of the two tests σ tol and ɛ are given in Section 5. There are significant parallels between DGPQ and the discrete-state DQL algorithm Strehl et al., 009, but the advancements are non-trivial. First, both algorithms maintain two Q functions, one for choosing actions ˆQ, and a temporary function that is updated on each step GP a, but in DGPQ we have chosen specific representations to handle continuous states and still maintain optimism and sample complexity guarantees. DQL s counting of up to m discrete state visits for each state cannot be used in continuous spaces, so DGPQ instead checks the immediate convergence of GP a at s t to determine if it should compare ˆQs t, a and GP a s t. Lastly, DQL s discrete learning flags are not applicable in continuous spaces, so DGPQ only compares the two functions as the GP variance crosses

6 a threshold, thereby partitioning the continuous MDP into areas that are known w.r.t. ˆQ and unknown, a property that will be vital for proving the algorithm s convergence. One might be tempted to use yet another GP to model ˆQs, a. However, it is difficult to guarantee the optimism of ˆQ ˆQ Q ɛ using a GP, which is a crucial property of most sample-efficient algorithms. In particular, if one attempts to update/overwrite the local prediction values of a GP, unless the kernel has finite support a point s value only influences a finite region, the overwrite will affect the values of all points in the GP and possibly cause a point that was previously correct to fall below Q s, a ɛ. In order to address this issue, we use an alternative function approximator for ˆQ which includes an optimism bonus as used in C-PACE Pazis & Parr, 03. Specifically, ˆQs, a is stored using a set of values that have been updated from the set of GP s, s i, a i, ˆµ i and a Lipschitz constant L Q that is used to find an optimistic upper bound for the Q function for s, a : { } ˆQs, a = min min ˆµ i+l Qds, a, s i, a, V max s i,a BV We refer to the set s i, a as the set of basis vectors BV. Intuitively, the basis vectors store values from the previously learned GPs. Around these points, Q cannot be greater than ˆµ i + L Q ds, a, s i, a by continuity. To predict optimistically at points not in BV, we search over BV for the point with the lowest prediction including the weighted distance bonus. If no point in BV is sufficiently close, then V max is used instead. This representation is also used in C-PACE Pazis & Parr, 03 but here it simply stores the optimistic Q-function; we do not perform a fixed point operation. Note the GP still plays a crucial role in Algorithm because its confidence bounds, which are not captured by ˆQ, partition the space into known/unknown areas that are critical for controlling updates to ˆQ. In order to perform an update partial overwrite of ˆQ, we add an element s i, a i, ˆµ i to the basis vector set. Redundant constraints are eliminated by checking if the new constraint results in a lower prediction value at other basis vector locations. Thus, the pseudocode for updating ˆQ is as follows: add point s i, a i, ˆµ i to the basis vector set; if for any j, µ i +L Q ds i, a, s j, a µ j, delete s j, a j, ˆµ j from the set. Figure shows the advantage of this technique over the naïve GP training discussed earlier. DGPQ learns the optimal action within 50 steps, while the naïve implementation as well as an ɛ-greedy variant both oscillate between the two actions. This example illustrates that the targeted exploration of DGPQ is of significant benefit compared to untargeted approaches, including the ɛ-greedy approach of GP-SARSA The Sample Complexity of DGPQ In order to prove that DGPQ is PAC-MDP we adopt a proof structure similar to that of DQL Strehl et al., 009 and refer throughout to corresponding theorems and lemmas. First we extend the DQL definition of a known state MDP M Kt, containing state/actions from ˆQ that have low Bellman residuals. Definition 4 During timestep t of DGPQ s execution with ˆQ as specified and ˆV s = max a ˆQs, a, the set of known states is given by K t = { s, a ˆQs, a Rs, a + γ s T s s, a ˆV s ds 3ɛ }. M K contains the same MDP parameters as M for s, a K and values of V max for s, a / K. We now recap from Theorem 0 of Strehl et al., 009 three sufficient conditions for proving an algorithm with greedy policy values V t is PAC-MDP. Optimism: ˆV t s V s ɛ for all timesteps t. Accuracy with respect to M Kt s values: ˆVt s V πt M Kt s ɛ for all t. 3 The number of updates to ˆQ and the number of times a state outside M K is reached is bounded by a polynomial function of N S r, ɛ,, γ. The proof structure is to develop lemmas that prove these three properties. After defining relevant structures and concepts, Lemmas and 3 bound the number of possible changes and possible attempts to change ˆQ. We then define the typical behavior of the algorithm in Definition 6 and show this behavior occurs with high probability in Lemma 4. These conditions help ensure property above. After that, Lemma 5 shows that the function stored in ˆQ is always optimistic, fulfilling property. Finally, property 3 is shown by combining Theorem number of steps before the GP converges with the number of updates to ˆQ from Lemma, as formalized in Lemma 7. We now give the definition of an update as well as successful update to ˆQ, which extends definitions from the original DQL analysis. Definition 5 An update or successful update of stateaction pair s, a is a timestep t for which a change to ˆQ an overwrite occurs such that ˆQ t s, a ˆQ t+ s, a > ɛ. An attempted update of state-action pair s,a is a timestep t for which s, a is experienced and GP a,t.vars > σ tol GP a,t+.vars. An attempted update that is not successful does not change ˆQ is an unsuccessful update. We bound the total number of updates to ˆQ by a polynomial term κ based on the concepts defined above. Lemma The total number of successful updates overwrites during any execution of DGPQ with ɛ = 3 ɛ γ

7 is ɛ γ 3Rmax κ = A N S 3L Q γ ɛ + Proof The proof proceeds by showing that the bound on the number of updates in the single-state case from Section 4. is, max 3R γ ɛ + based on driving the full function from V max to V min. This quantity is then multiplied by the covering number and the number of actions. See the Supplementary Material. The next lemma bounds the number of attempted updates. The proof is similar to the DQL analysis and shown in the Supplementary Material. Lemma 3 The total number of attempted updates overwrites during any execution of GPQ is A N S ɛ γ 3L Q + κ. We now define the following event, called A to link back to the DQL proof. A describes the situation where a state/action that currently has an inaccurate value high Bellman residual with respect to ˆQ is observed m times and this causes a successful update. Definition 6 Define Event A to be the event that for all timesteps t, if s, a / K k and an attempted update of s, a occurs during timestep t, the update will be successful, where k < k <... < k m = t are the timesteps where GP ak.vars k > σtol since the last update to ˆQ that affected s, a. We can set an upper bound on m, the number of experiences required to make GP at.vars t < σtol, based on m in Theorem from earlier. In practice m will be much smaller but unlike discrete DQL, DGPQ does not need to be given m, since it uses the GP variance estimate to decide if sufficient data has been collected. The following lemma shows that with high probability, a failed update will not occur under the conditions of event A. Lemma 4 By setting σtol ω = nɛ γ 4 9Rmax log 6 A N ɛ γ S 3L Q + κ we ensure that the probability of event A occurring is /3. Proof The maximum number of attempted updates is A N ɛ γ S 3L Q + κ from Lemma 3. Therefore, by setting =, ɛ = ɛ γ, and 3 ɛ γ 3 A N S +κ 3L Q 8 9 V m = Rmax, we have from Lemma that during each overwrite, γ there is no greater than probability = ɛ γ 3 A N S +κ 3L Q of an incorrect update. Applying the union bound over all of the possible attempted updates, we have the total probability of A not occuring is. 3 The next lemma shows the optimism of ˆQ for all timesteps with high probability 3. The proof structure is similar to Lemma 3.0 of Pazis & Parr, 03. See supplemental material. Lemma 5 During execution of DGPQ, Q s, a Q t s, a + ɛ γ holds for all s, a with probability 3. The next Lemma connects an unsuccessful update to a state/action s presence in the known MDP M K. Lemma 6 If event A occurs, then if an unsuccessful update occurs at time t and GP a.vars < σtol at time t + then s, a K t+. Proof The proof is by contradiction and shows that an unsuccessful update in this case implies a previously successful update that would have left GP a.vars σtol. See the Supplemental material. The final lemma bounds the number of encounters with state/actions not in K t, which is intuitively the number of points that make updates to the GP from Theorem times the number of changes to ˆQ from Lemma. The proof is in the Supplemental material. Lemma 7 Let η = N ɛ γ S 3L Q. If event A occurs and ˆQ t s, a Q s, a ɛ γ holds for all t and s, a then the number of timesteps ζ where s t, a t / K t is at most 3Rmax ζ = m A η γ ɛ + 0 where m = 36R max 6 γ 4 ɛ log A η + κ A η Finally, we state the PAC-MDP result for DGPQ, which is an instantiation of the General PAC-MDP Theorem Theorem 0 from Strehl et al., 009. Theorem Given real numbers 0 < ɛ < γ and 0 < < and a continuous MDP M there exist inputs σtol see 9 and ɛ = 3ɛ γ such that DGPQ executed on M will be PAC-MDP by Definition with only R maxζ ɛ γ log log ɛ γ

8 Steps to goal CPACE DGPQ Episode CPU Time CPACE DGPQ Step Figure. Average 0 runs steps to the goal and computation time for C-PACE and DGPQ on the square domain. timesteps where Q ts t, a t < V s t ɛ, where ζ is defined in Equation 0. The proof of the theorem is the same as Theorem 6 by Strehl et al., 009, but with the updated lemmas from above. The three crucial properties hold when A occurs, which Lemma 4 guarantees with probability 3. Property optimism holds from Lemma 5. Property accuracy holds from the Definition 4 and analysis of the Bellman Equation as in Theorem 6 by Strehl et al., 009. Finally, property 3, the bounded number of updates and escape events, is proven by Lemmas and Empirical Results Our first experiment is in a -dimensional square over [0, ] designed to show the computational disparity between C-PACE and DGPQ. The agent starts at [0, 0] with a goal of reaching within a distance of 0.5 of [, ]. Movements are 0. in the four compass directions with additive uniform noise of ±0.0. We used an L distance metric, L Q = 9, and an RBF kernel with θ = 0.05, ω n = 0. for the GP. Figure shows the number of steps needed per episode capped at 00 and computational time per step used by C-PACE and DGPQ. C-PACE reaches the optimal policy in fewer episodes but requires orders of magnitude more computation during steps with fixed point computations. Such planning times of over 0s are unacceptable in time sensitive domains, while DGPQ only takes 0.003s per step. In the second domain, we use DGPQ to stabilize a simulator of the longitudinal dynamics of an unstable F6 with linearized dynamics, which motivates the need for a real-time computable policy. The five dimensional state space contains height, angle of attack, pitch angle, pitch rate, and airspeed. Details of the simulator can be found in Stevens & Lewis, 003. We used the reward r = h h d /00ft ḣ /00ft/s e, with aircraft height h, desired height h d and elevator angle degrees e. The control input was discretized as e {, 0, } and the elevator was used to control the aircraft. The thrust input to the engine was fixed as the reference thrust command at Reward Averaged over 0 Trials Reward per Episode 500 L 600 Q = 5 L Q = Episode Figure 3. Average 0 runs reward on the F6 domain. the unstable equilibrium. The simulation time step size was 0.05s and at each step, the air speed was perturbed with Gaussian noise N 0, and the angle of attack was perturbed with Gaussian noise N 0, 0.0. A RBF kernel with θ = 0.05, ω n = 0. was used. The initial height was h d, and if h h d 00, then the vertical velocity and angle of attack were set to zero to act as boundaries. DGPQ learns to stabilize the aircraft using less than 00 episodes for L Q = 5 and about 000 episodes for L Q = 0. The disparity in learning speed is due to the curse of dimensionality. As the Lipschitz constant doubles, N c increases in each dimension by, resulting in a 5 increase in N c. DGPQ requires on average 0.04s to compute its policy at each step, which is within the 0Hz command frequency required by the simulator. While the maximum computation time for DGPQ was 0.s, the simulations were run in MATLAB so further optimization should be possible. Figure 3 shows the average reward of DGPQ in this domain using L Q = {5, 0}. We also ran C-PACE in this domain but its computation time reached over 60s per step in the first episode, well beyond the desired 0 Hz command rate. 7. Conclusions This paper provides sample efficiency results for using GPs in RL. In section 3, GPs are proven to be usable in the KWIK-Rmax MBRL architecture, establishing the previously proposed algorithm GP-Rmax as PAC-MDP. In section 4., we prove that existing model-free algorithms using a single GP have exponential sample complexity, connecting to seemingly unrelated negative results on Q-learning learning speeds. Finally, the development of DGPQ provides the first provably sample efficient model-free without a planner or fixed-point computation RL algorithm for general continuous spaces. Acknowledgments We thank Girish Chowdhary for helpful discussions, Hassan Kingravi for his GP implementation, and Aerojet- Rocketdyne and ONR #N for funding.

9 References Brafman, Ronen I. and Tennenholtz, Moshe. R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:3 3, 00. Chowdhary, Girish, Liu, Miao, Grande, Robert C., Walsh, Thomas J., How, Jonathan P., and Carin, Lawrence. Offpolicy reinforcement learning with gaussian processes. Acta Automatica Sinica, To appear, 04. Chung, Jen Jen, Lawrance, Nicholas R. J., and Sukkarieh, Salah. Gaussian processes for informative exploration in reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pp , 03. Csató, L. and Opper, M. Sparse on-line gaussian processes. Neural Computation, 43:64 668, 00. Deisenroth, Marc Peter and Rasmussen, Carl Edward. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the International Conference on Machine Learning ICML, pp , 0. Engel, Y., Mannor, S., and Meir, R. Reinforcement learning with Gaussian processes. In Proceedings of the International Conference on Machine Learning ICML, 005. Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 994. Rasmussen, C. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 006. Stevens, Brian L. and Lewis, Frank L. Aircraft control and simulation. Wiley-Interscience, 003. Strehl, Alexander L., Li, Lihong, Wiewiora, Eric, Langford, John, and Littman, Michael L. PAC model-free reinforcement learning. In Proceedings of the International Conference on Machine Learning ICML, pp , 006. Strehl, Alexander L., Li, Lihong, and Littman, Michael L. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 0:43 444, 009. Sutton, R. and Barto, A. Reinforcement Learning, an Introduction. MIT Press, Cambridge, MA, 998. Watkins, C. J. Q-learning. Machine Learning, 83:79 9, 99. Even-Dar, Eyal and Mansour, Yishay. Learning rates for Q-learning. Journal of Machine Learning Research, 5: 5, 004. Jung, Tobias and Stone, Peter. Gaussian processes for sample efficient reinforcement learning with RMAX-like exploration. In Proceedings of the European Conference on Machine Learning ECML, 00. Kalyanakrishnan, Shivaram and Stone, Peter. An empirical analysis of value function-based and policy search reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp , 009. Li, Lihong and Littman, Miachael L. Reducing reinforcement learning to KWIK online regression. Annals of Mathematics and Artificial Intelligence, 583-4:7 37, 00. Li, Lihong, Littman, Michael L., Walsh, Thomas J., and Strehl, Alexander L. Knows what it knows: a framework for self-aware learning. Machine Learning, 83: , 0. Pazis, Jason and Parr, Ronald. PAC optimal exploration in continuous space markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, 03.

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

PAC Model-Free Reinforcement Learning

PAC Model-Free Reinforcement Learning Alexander L. Strehl strehl@cs.rutgers.edu Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora ewiewior@cs.ucsd.edu Computer Science

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

An Analysis of Model-Based Interval Estimation for Markov Decision Processes An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

A Theoretical Analysis of Model-Based Interval Estimation

A Theoretical Analysis of Model-Based Interval Estimation Alexander L. Strehl Michael L. Littman Department of Computer Science, Rutgers University, Piscataway, NJ USA strehl@cs.rutgers.edu mlittman@cs.rutgers.edu Abstract Several algorithms for learning near-optimal

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Improving PAC Exploration Using the Median Of Means

Improving PAC Exploration Using the Median Of Means Improving PAC Exploration Using the Median Of Means Jason Pazis Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 039, USA jpazis@mit.edu Ronald Parr Department

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Directed Exploration in Reinforcement Learning with Transferred Knowledge

Directed Exploration in Reinforcement Learning with Transferred Knowledge JMLR: Workshop and Conference Proceedings 24:59 75, 2012 10th European Workshop on Reinforcement Learning Directed Exploration in Reinforcement Learning with Transferred Knowledge Timothy A. Mann Yoonsuck

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING

PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING BY ALEXANDER L. STREHL A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New Jersey

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

DESIGNING controllers and decision-making policies for

DESIGNING controllers and decision-making policies for Real-World Reinforcement Learning via Multi-Fidelity Simulators Mark Cutler, Thomas J. Walsh, Jonathan P. How, Senior Member, IEEE Abstract Reinforcement learning (RL) can be a tool for designing policies

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes Journal of Computer and System Sciences 74 (2008) 1309 1331 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.elsevier.com/locate/jcss An analysis of model-based Interval

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

More information

Autonomous Helicopter Flight via Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Sample Complexity of Multi-task Reinforcement Learning

Sample Complexity of Multi-task Reinforcement Learning Sample Complexity of Multi-task Reinforcement Learning Emma Brunskill Computer Science Department Carnegie Mellon University Pittsburgh, PA 1513 Lihong Li Microsoft Research One Microsoft Way Redmond,

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Bellmanian Bandit Network

Bellmanian Bandit Network Bellmanian Bandit Network Antoine Bureau TAO, LRI - INRIA Univ. Paris-Sud bldg 50, Rue Noetzlin, 91190 Gif-sur-Yvette, France antoine.bureau@lri.fr Michèle Sebag TAO, LRI - CNRS Univ. Paris-Sud bldg 50,

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Sparse Kernel-SARSA(λ) with an Eligibility Trace

Sparse Kernel-SARSA(λ) with an Eligibility Trace Sparse Kernel-SARSA(λ) with an Eligibility Trace Matthew Robards 1,2, Peter Sunehag 2, Scott Sanner 1,2, and Bhaskara Marthi 3 1 National ICT Australia Locked Bag 8001 Canberra ACT 2601, Australia 2 Research

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Value Function Approximation through Sparse Bayesian Modeling

Value Function Approximation through Sparse Bayesian Modeling Value Function Approximation through Sparse Bayesian Modeling Nikolaos Tziortziotis and Konstantinos Blekas Department of Computer Science, University of Ioannina, P.O. Box 1186, 45110 Ioannina, Greece,

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Bounded Optimal Exploration in MDP

Bounded Optimal Exploration in MDP Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Bounded Optimal Exploration in MDP Kenji Kawaguchi Massachusetts Institute of Technology Cambridge, MA, 02139 kawaguch@mit.edu

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information