arxiv: v1 [cs.lg] 1 Jan 2019

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 1 Jan 2019"

Transcription

1 Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds arxiv: v1 [cs.lg] 1 Jan 2019 Andrea Zanette Institute for Computational and Mathematical Engineering, Stanford University Abstract Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problem-dependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the nextstate value functions. An important benefit of our algorithmic is that it does not require apriori knowledge of a bound on the environmental norm. As a result of our analysis, we also help address an open learning theory question [JA18] about episodic MDPs with a constant upper-bound on the sum of rewards, providing a regret bound with no H-dependence in the leading term that scales a polynomial function of the number of episodes. 1 Introduction Emma Brunskill Department of Computer Science, Stanford University In reinforcement learning RL an agent must learn how to make good decision without having access to an exact model of the world. Most of the literature for provably efficient exploration in Markov decision processes MDPs [JOA10, OVR13, LH14, DB15, DLB17, OR17, AOM17, KWY18] has focused on providing near-optimal worst-case performance bounds: such bounds are highly desirable as they do not depend on the structure of the particular environment considered and therefore hold for even extremely hardto-learn MDPs. Fortunately in practice reinforcement learning algorithms often perform far better than what these problem-independent bounds would suggest. While we may observe better or worse performance empirically on different MDPs, we would like to derive a more systematic understanding of what types of decision processes are inherently easier or more challenging for RL. This motivates our interest in deriving algorithms and theoretical analyses that provide problemdependent bounds. Ideally, such algorithms will do as well as RL solutions designed for the worst case if the problem is pathologically difficult and otherwise match the performance bounds of algorithms specifically designed for a particular problem subclass. This exciting scenario might bring considerable saving in the time spent designing domain-specific RL solutions and in training a human expert to judge and recognize the complexity of different problems. An added benefit would include the robustness of the RL solution in case the actual model does not belong to the identified subclass, yielding increased confidence to deploying RL to high-stakes applications. Towards this goal, in this paper we contribute a new algorithm for episodic tabular reinforcement learning which automatically provides provably stronger regret bounds in many domains which have a smaller environmental norm [MMM14], which is a characterization of the variance of the optimal value function. Indeed, there is good reason to believe that some features of the range or variability of the optimal value function should be a critical aspect of the hardness of reinforcement learning in a MDP. Many worst-case bounds for

2 finite-state MDPs scale with a worst case bound on the range / magnitude of the value function. For example, Ucrl2 from [JOA10] and a recently proposed algorithm from [AOM17] enjoy a regret bound 1 of Õ DS AT Undiscounted [JOA10] Õ HSAT Fixed Horizon [AOM17] 1 where D is the diameter for an infinite-horizon setting and H, the horizon in an episodic problem, respecctively. Note that here both D and H arise in the analyses as upper bounds on the range of the optimistic value function across the entire MDP. In fact, many RL algorithms with strong performance bounds rely on the principle of optimism under uncertainty and compute an optimistic value function. As more samples are collected, one would hope that the agent s optimistic value function converges to the true optimal value function. Unfortunately this is not the case, see for example [JOA10, BT09, ZB18] for a discussion of this. As a result, most prior analyses bounded the optimistic value function by generic quantities like D or H regardless of the actual behaviour of the optimal value function. While the majority of formal performance guarantees has focused on bounds for the worst case, there have been several contributions of algorithms and/or theoretical analysis focused on MDPs with particular structure. One quantity of interest is the range of the optimal value function, and prior work [BT09] proposed an algorithm that can replace the diameter D with an overestimate of range of the optimal value function across all states. A tractable version this algorithm was subsequently derived [FPLO18]. However, while these approaches provide strong regret bounds that highlight the better performance that can be obtained in domains with a small range in the optimal value function, these approaches assume apriori domain knowledge of the range of the optimal value function to attain improved performance. In contrast, [MMM14] introduced a slightly different quantity, the environmental norm, that depends on the variance of only the optimal value function across successor states. The authors demonstrate this measure of MDP complexity empirically explains the complexity of doing RL in a number of common simulation benchmarks. However, the algorithm and analysis they derive do not enjoy an improved regret bound compared to the available worst-case analyses, nor do they provide performance bounds for specific domain subclasses, so their primary contribution is a good complexity measure that can help empirically explain the different hardness of various RL problems. 1 leading order term only In our paper we build on this promising definition of domain complexity, the environmental norm, but go significantly further to show that we can derive an algorithm for finite horizon discrete MDPs and associated analysis that yields state-of-the art worst-case regret bounds of order Õ HSAT in the leading term all the while improving if the environment has small environmental norm. Compared to the existing literature, our work Maintains state of the art worst-case guarantees [AOM17, KWY18], Improves on the regret bounds of[bt09, FPLO18] when restricted to the episodic setting and without the need of domain knowledge of a bound on the range of the value function, Improves the regret bounds of [ZB18] when deployed in the same settings, Provides the first demonstration that characterizing problems using environmental norm[mmm14] can yield substantially tighter theoretical guarantees, and Identifies significant problem classes with low environmental norm which are of significant interest, including deterministic domains, single-goal MDPs, and high stochasticity domains. Helps address an open learning theory problem [JA18], showing that for their setting, we obtain a regret bound that scales with no dependence on the planning episode horizon in the leading term that is polynomial function of the number of episodes. The paper is organized as follows: we recall some basic definitions in Section 2 and describe the algorithm in Section 3. We state and comment the main result in Section 4, discuss how this helps address an open learning theory problem in Section 5 and then describe selected problem-dependent bounds in Section 6. The analysis is sketched in Section 7. Due to space constraints, most proofs are deferred to the appendix. 2 Preliminaries In this section we introduce some notation and definitions commonly adopted in RL. We consider undiscounted finite horizon MDPs [SB98], which are defined by a tuple M = S,A,p,r,H, where S and A are the state and action spaces with cardinality S and A, respectively. We denote by ps s,a the probability of transitioning to state s after taking action a

3 in state s while rs,a [0,1] is the average instantaneous reward collected. We label with n k s,a the visits to the s,a pair at the beginning of the k-th episode. The agent interacts with the MDP starting from arbitrary initial states in a sequence of episodes k [K] where [K] = {j N : 1 j K} of fixed length H by selecting a policy which maps states s and timesteps t to actions. Each policy identifies a value function for every state s and timestep t [H] defined as V π k H t s t = E s,a πk i=trs,a which is the expected return until the end of the episode the conditional expectation is over the pairs s, a encountered in the MDP upon starting from s t. The optimal policy is indicated with π and its value function as Vt π. We indicate with and, respectively, a pointwise underestimate, respectively, overestimate, of the optimal value function and with ˆp k s,a and ˆr k s,a the MLE estimates of p s,a and rs,a. We focus on deriving a high probability upper bound on the RegretK def = k [K] V1 π s k 1 s k to measure the agent s learning performance. We use the Õ notation to indicate a quantity that depends on up to a polylog expression of a quantity at most polynomial in S,A,T,K,H, 1 δ. We also use the,, notation to mean,,=, respectively, up to a numerical constant and indicate with X 2,p the 2-norm of a random variable 2 under p, i.e., def X 2,p = E p X 2 def = s ps X 2 s if p is its probability mass function. 3 Euler We introduce the algorithm Episodic Upper Lower Exploration in Reinforcement learning Euler which adopts the paradigm of optimism under uncertainty to conduct exploration. Recent work [DB15, DLB17, AOM17, KWY18] has demonstrated how the choice of the exploration bonus is critical to enabling tighter problem-independent performance bounds. In particular, tighter confidence bounds on the expected future rewards, ˆp k s,a p s,a Vt+1, π obtained by leveraging Bernstein s inequality, has lead to significant improvements. However, as the true transition model p s,a and next state values Vt+1 π are unknown, existing algorithms must rely on the empirical estimates of these quantities to construct tight confidence intervals but must do so in a way that preserves optimism which is a critical property of most theoretical analyses. This has given rise to various methods that introduce additional correction terms to ensure optimism. 2 To be precise, this is a norm between classes of random variables that are almost surely the same In our algorithm we use the empirical Bernstein Inequality [MP09] for estimating both the rewards and transition dynamic deviation terms b r k s,a and φp,v in the algorithm, respectively with a different correction term. This allows Euler to preserve in the dominant term the best problem-independent bound Õ HSAT for this setting while simultaneously attaining much stronger problem-dependent bounds than have been previously obtained. The key insight is that by maintaining point-wise consistent upper and lower bounds directly on V π, we can define a bonus term that both guarantees optimism, decays quickly and is a function of the underlying problem. We provide pseudocode for Euler which details the main procedure in figure 1. Notice that Euler has the same computational complexity as value iteration. 4 Main Result Before stating the result, we recall the definition of environmental norm which first appeared in [MMM14]. The definition advanced in [MMM14] is for infinite horizon RL but can very easily be adapted to fixed horizon RL while including the reward variances: Definition 1 Environmental Norm. The environmental norm Q of an MDP is the maximum conditional variance conditioned on the s, a pair of the immediate reward R, random variable and nextstate optimal value function: Q def = max s,a VarRs,a+ Var V π s + ps,a t+1s +. 2 Intuitively, if Q is small then it is not difficult to estimatetheq t valuesfromtheavailablesamplesprovided that Vt+1 π is known. 4.1 Problem Dependent Regret Bound Now we present our main result, which is a problemdependent high-probability regret upper bound for Euler and a worst-case guarantees that matches the established [OV16, JOA10] lower bound of Ω HSAT in the dominant term. Before this, we define the maximum return under any policy and any realization of the rewards and transitions within an episode as G. In other words, G is a deterministic upper bound on the random total reward collected within an episode. Theorem 1 Problem Dependent High Probability Regret Upper Bound for Euler. With probability at least 1 δ the regret of Euler is bounded for any time T KH by the minimum between O polylogs,a,t,1/δ Q SAT + SSAH 2 S +H 3

4 Algorithm 1 Euler for Stationary Episodic MDPs = 1 6 1: Input: failure probability δ, δ def 2: φp,v def = 2 Var pv ln 4SAT δ n k s,a 3: for k = 1,2,... do 4: for t = H,H 1,...,1 do 5: for s S do 6: for a A do 7: b pv k = φˆps,a,v t δ, L def = 2ln12SAT/δ, V H+1 = V H+1 = 0 7H ln 4SAT δ 3n k s,a, b r k s,a = ns,a 2 VarRs,aln 4SAT δ n k s,a 28HL +L V t+1 V 3 n k s,a t+1 ˆps,a + 7ln 4SAT δ 3n k s,a 8: Qa = min{h t,ˆr k s,t+b r k s,a+ ˆp V t+1 +b pv k } 9: end for 10: s,t = argmax a Qa, V t s = Qs, π k s,t 11: b pv k = φˆps, πs,t,v 1 t+1 28HL +L V t+1 V ns,a 3 n k s, s,t t+1 ˆps, πs,t 12: V t s = max{0,ˆr k s, s,t b r k s,s,t+ ˆp V t+1 b pv k } 13: end for 14: end for 15: Evaluate policy and update MLE estimates ˆp, and ˆr, 16: end for and directly implies that Euler can leverage the structure G 2 O polylogs, A, T, 1/δ H SAT + SSAH 2 of a number of commonly encountered MDPs. S +H, Maillard et al. [MMM14] also propose an algorithm 4 that depends on this useful metric of hardness and jointly for all episodes k [K]. show a regret bound whose leading order term is: Since the rewards are in [0,1], we immediately have that G 2 H 2, and obtain the following corollary for the worst-case regret bound: Corollary 1.1. With probability at least 1 δ the regret of Euler is bounded for any time T KH by Õ HSAT + SSAH 2 S +H. 5 Note that this matches in the dominant term the minimax regret problem independent bounds for tabular episodic RL settings [AOM17]. Therefore, the importance of our theorem 1 lies in providing problem dependent bounds equation 3,4 while simultaneously matching the existing best worst case guarantees equation 5. In addition, we shall shortly see that our results also provide tighter results than prior work for a general alternate definition of episodic MDPs that was recently introduced [JA18]. The key quantity that encodes the MDP complexity is Q, the environmental norm of [MMM14]. Notice that the algorithm does not need to be informed of the value of Q. Prior empirical evaluation of Q in [MMM14] has shown in a number of common empirical benchmarks Q has small value: see [MMM14] for a description and references. This empirical evaluation of the environmental norm is encouraging, because it 1 Õ DS Q p AT. 6 0 Unfortunately to our understanding their submission does not improve over worst-case analyses for that setting since the worst-case value function upper bound which is the diameter D in their setting and H in ours is still a factor that shows up in the leading order regret contribution. Said factor is instead replaced altogether by the environmental norm in our analysis. In addition, the inverse of the minimum non-zero transition probability p 0 = min s,s,aps s,a appears in the regret bound. As this can be arbitrarily small, the bound becomes vacuous. Further, their algorithmmustknownp 0. Forthesereasons,theauthors of [MMM14] recognize that their main contribution is mostly about identifying an useful norm and its dual that characterizes the hardness of a number of empirical problems, but the question of whether the environmental norm is a useful quantity from a theoretical point of view was essentially left as open problem. Indeed our contribution can be viewed as a realization of that vision, i.e., proposing an algorithm and analysis which order-wise scale with the environmental norm.

5 5 Planning Horizon Dependence in Dominant Term Before discussing the implications of our above result in several common classes of Markov decision processes with particular different forms of structure, we first show that these new results can help address a recent posed open question in the learning theory community [JA18]. The question posed centers on the whether there should exist a necessary dependence of sample complexity and regret lower bounds on the planning horizon H for episodic tabular MDP reinforcement learning tasks. Existing lower bound results for sample complexity [DB15] do not have a dependence on the horizon, and the best existing minimax regret bounds do have a dependence on the horizon under asymptotic assumptions[aom17]; however, such results have been derived under the common assumption of reward uniformity, that per-time-step rewards are bounded between 0 and 1. Jiang and Agarwal [JA18] instead pose a more general setting, in which they assume that per time step rewards r h 0 and H h=1 r h [0,1] holds almost surely: note the standard setting of reward uniformity can be expressed in this setting by first normalizing all rewards by dividing by H. The authors then ask that if in this new, more general setting of tabular episode RL, if there is necessarily a dependence on the planning horizon in the lower bounds. 9 Note that the planning/episodic horizon H does not appear in the dominant regret term which scales polynomially with the number of episodes K, and only appears in terms that scale logarthmically with the number of episodes K, or as part of transient lower order terms that are independent of K. In other words, up to logarithmic dependency and transient terms, we have an upper bound on episodic regret that is independent of the horizon H. This result then answers part of Jiang and Agarwal s open question: for their setting, we have shown regret scales independently of the horizon H in the dominant term. Notice that our algorithm has no dependence on H that scales as polynomial function of K, and our algorithm does not need to be provided with information about the maximum possible value function. Jiang and Agarwal [JA18] also ask about the dependence of a lower bound on the sample complexity on the planning horizon. While our work focuses on a regret analysis, and does not provide PAC sample complexity results, we can use our regret results to bound with high probability the number of episodes needed to ensure that the average per-episode regret is less than ǫ To do so, we obtain the average per-episode loss of Euler by dividing by the number of episodes K: O polylogs,a,k,h,1/δ SA K + SSAH 2 S +H From here we can seek for the smallest K such that the average error is smaller than ǫ, obtaining: SA SSAH 2 O polylogs,a,k,h,1/δ ǫ + S +H 2 ǫ 11 episodes before the average per-episode error is smaller than ǫ. For small ǫ << SA, the first term dominates, which is again independent of a polynomial dependence on H. Of course, in order to formally obtain a PAC result a worst case upper bound on the number Their assumptions immediately imply that of ǫ-suboptimal episodes the algorithm would need to be modified to be PAC. In particular, the exploration 0 Vt π s G 1, s,t S [H]. 7 bonus for a given s,a pair should be designed so it does not increasewith time T if s,a is not visited. In Further, since Vt π s 1 and rs,a 0 we must practice this means replacing the logt factor of the have rs,a [0,1], which is the assumption of this exploration bonuses with something like logn where work. Therefore our main result theorem 1 applies n is the visit count to a specific state-action pair and here. Recalling that T = KH, we obtain an upper bound on regret as adjusting the numerical constant to make sure the exploration bonuses / confidence intervals are still valid 1 O polylogs,a,k,h,1/δ H SAKH + SSAH 2 with high probability. Please see [DLB17] for a detailed explanation of how to proceed with the algo- S +H rithm design and analysis is this case 8 3. = O polylogs,a,k,h,1/δ SAK + SSAH 2 S +H. It remains an open question whether we could further K 10 avoid either a dependence on in our regret bounds on the planning horizon either in the transient terms or in the log dependencies. However, these results are promising, since they suggest that the hardness of learning in sparse reward, and long horizon episodic MDP environments, may not be fundamentally much harder than shorter horizon domains, as long as the total reward is bounded. We next demonstrate that several classes of episodic MDPs have additional structure that also guarantee 3 Indeed, the algorithm described in [DLB17] is structurally similar to ours.

6 better performancelower regret bounds, even though our algorithm need not be aware of such structure. 6 Problem dependent bounds We now focus on deriving regret bounds for selected MDP classes that are very common in RL. We emphasize that such setting-dependent guarantees are obtained with the same algorithm which is agnostic to the particular MDP at hand and its Q value. Although the described settings share common features and are sometimes subclasses of one another, they are emphasized in separate subsections due to their important relation to the past literature as well as their practical relevance. Importantly, they are all characterized by low environmental norm. 6.1 Bounds using the range of optimal value function To improve over the worst case bound in infinite horizonrltherehavebeenapproachesthataimatobtaining stronger problem dependent bounds if the value function does not vary much across different states of the MDP. If rngv π is smaller than the worst-case either H or D for the fixed horizon vs recurrent RL, the reduced variability in the expected return suggests that we can lower the exploration effort by constructing tighter confidence intervals. This is achieved in [BT09] by providing additional information to their algorithm Regal, achieving a regret bound of order: ÕΦS AT 12 where Φ rngv π is an overestimate of the optimal value function range and is an input to the algorithm described in that paper. This means that if domain knowledge is available and is supplied to the algorithm the regret can be substantially reduced. This line of research was followed in [FPLO18] which derived a computationally-tractable variant of Regal. However, they still require knowledge of a value function rangeupper bound Φ rngv π. Specifying a too high value for Φ would increase the regret and a too low value would cause the algorithm to fail. Although restricted to the episodic setting, our analysis implies that it is possible to achieve at least the same but potentially much better level of performance without knowing the optimal value function range to construct the exploration bonus. This follows as an easy corollary of our main regret upper bound theorem 1 after bounding the environmental norm, as we discuss below. Let S s,a be the set of immediate successor states after one transition from state s upon taking action a there, thatis, thestatesinthesupportofp s,aanddefine Φ succ def = max rng Vt+1s π + 13 s,a s + S s,a as the maximum value function range when restricted to the immediate successor states. Since the variance is upper bounded by one fourth of the square range of a random variable we have that: Q def = max max s,a s,a VarRs,a s,a+ Var [rngrs,a] 2 +[ rng V π s + S s,a V π s + ps,a t+1 s+ t+1 s+ ] 2 1+Φ 2 succ. This immediately yields a result comparable to that of [BT09, FPLO18] when restricted to the fixed horizon framework: Corollary 1.2 Bounded Range of V π Among Successor States. With probability at least 1 δ, the regret of Euler is bounded by: ÕΦ succ SAT + SSAH 2 S +H. 14 Few remarks are in order: the problem independent bound D or H is replaced by a problem dependent quantity as in [BT09, FPLO18] Euler does not need to know the value of Φ succ or of the environmental norm or of the value function range to attain the improved bound Φ succ can be much smaller than rngv π because it is the range of V π restricted to few successor states as opposed to across the whole domain, and therefore it is always smaller than the overestimateφusedin[bt09,fplo18], inotherwords: Φ rngv π Φ succ. 15 [BT09, FPLO18] consider the more challenging infinite horizon setting. Our results holds for fixed horizon RL. It s an interesting research question whether our result can be extended to the recurrent RL setting considered by [BT09, FPLO18]. 6.2 Bounds on the next-state variance of V π and empirical benchmarks As we discuss throughout this section, the environmental norm together with our new result in Theorem 1 can help explain the hardness of subclasses of MDP domains. Before considering another important

7 subclass in the next subsection, we first wish to highlight that the environmental norm also can empirically characterize the hardness of RL in single problem instances. This was one of the nice key contributions of the work that introduced the envirornmental norm [MMM14], which evaluated the environmental norm for a number of common RL benchmark simulation tasks including mountain car, pinball, taxi, bottleneck, inventory and red herring see the table in paragraph 3.2 in [MMM14]. In these domains the environmental norm is nicely correlated with the complexity of reinforcement learning in these environments, as evaluated empirically. Indeed, in these settings, the environmental norm is often much smaller then the maximum value function range, which can itself be much smaller than the worst-case bound D or H. Our new results provide solid theoretical justification for the observed empirical savings. This measure of MDP complexity also intriguingly allows us to gain more insight on another important simulation domain, chain MDPs like that in Figure 1. Chain MDPS have been considered a canonical example of challenging hard-to-learn RL domains, since naive strategies like ǫ greedy can take exponential time to attain satisfactory performance. By setting for simplicity N def = S = H efficient exploration strategies for example [AOM17, KWY18] should incur Õ HSAT = Õ N 2 T regret in the leading order term. In contrast our Euler only incurs leading order term only Õ T leading orderregret. This is intriguing because it reinforces the idea that truly pathological MDPs may be even less common than expected. More details about this example are given in appendix A.1 r = 1 4N 1 N 1 1 N s1 s2 s3 sn 2 sn 1 sn 1 1 N 1 1 N 1 1 N 1 1 N N 1 1 N 1 1 N 1 1 N 1 1 N Figure 1: Classical hard-to-learn MDP 6.3 Stochasticity in the system dynamics In this section we consider two opposite classes of problem, namely deterministic MDPs and MDPs that are highly stochastic in that the successor state is sampled from a fixed distribution. These bounds are also a direct consequence of theorem 1 and can alternatively be deduced from corollary 1.2 but are discussed separately due to their importance. Of course, there exists a wide spectrum of MDPs ranging from deterministic to fully stochastic ones; here for simplicity we consider only the limit cases. 1 1 N r = 1 Deterministic domains Many problems of practical interest, for example in robotics, have low stochasticity, andthisimmediatelyyieldslowvalueforq. As a limit case, we consider domains with deterministic rewards and dynamics models. An agent designed to learn deterministic domains only needs to experience every transition once to reconstruct the model, which can take up to OSA episodes with a regret at most OSAH [WV13]. For Euler an improved bound is straightforwardly derived because Q = 0. Therefore if Euler is run on any deterministic MDP then the regret expression exhibits only a logt dependence. This is a substantial improvement since prior RL regret bounds for problem-independent settings all have at least a T dependence. Further, a refined analysis sec G.3 in appendix shows that Euler is close to the lower bound except for a factor in the horizon and logarithmic terms: Proposition 1. If Euler is run on a deterministic MDP then the regret is bounded by ÕSAH2. The result of the above proposition is important because it supports the idea that Euler can inherit nearoptimal performance even in domains very different from the pathologically hard MDP for which it is designed to do well. Highly mixing domains Recently, [ZB18] show that it is possible to design an algorithm that can switch between the MDP and the contextual bandit framework while retaining near-optimal performance in both without being informed of the setting. They consider mapping contextual bandit to an MDP whose transitions to different states or contexts are sampled from a fixed underlying distribution over which the agent has no control. The Bandit-MDP considered in [ZB18] is an environment with high stochasticity the MDP is highly mixing since every state can be reached with some probability in one step. Since the transition function is unaffected by the agent, an easy computation yields rngvt π 1, as replicated in appendix A.2, and a regret guarantee in the leading order term of order Õ SAT for Euler which matches the established lower bound for tabular contextual bandits [BCB12] follows from corollary 1.2. This is useful since in many practical applications it is unclear in advance if the domain is best modeled as a bandit or a sequential RL problem. Notice that our results improves over [ZB18] since Euler has better worst-case guarantees by a factor of H and most importantly can deal with nextstate distributions that have zero or near zero mass over some of the next states: the inverse minimum visitation probability does not show up in our analy-

8 sis. 7 Sketch of the Theoretical Analysis We devote this section to the sketch of the main point of the regret analysis that yields problem dependent bounds. Central to the analysis is the relation between the agent s optimistic MDP and the true MDP. A more detailed overview of the proof is given in section B of the appendix, while the rest of the appendix presents the detailed analysis under a more general framework. Regret Decomposition Denote with E s,a πk the expectation taken along the trajectories identified by the agent s policy. A standard regret decomposition is given below see [DLB17, AOM17]: RegretK E s,a πk r k s,a rs,a Reward Estimation and Optimism k [K] t [H] s,a S A + p k s,a ˆp k s,a Transition Dynamics Optimism +ˆp k s,a p s,a Vt+1 π Transition Dynamics Estimation +ˆp k s,a p s,a Vπ t+1 Lower Order Term 16 Here, the tilde quantities r and p represent the agent s optimistic estimate. Of the terms in equation 16, the Transition Dynamics Estimation and Transition Dynamics Optimism are the leading terms to bound as far as the regret is concerned. The former is expressed through MDP quantities i.e, the true transition dynamics p s,a and the optimal value function Vt+1 π and hence it can be readily bounded using Bernstein Inequality, giving rise to a problem dependent regret contribution. More challenging is to show that a similar simplification can be obtained for the Transition Dynamics Optimism term which relies on the agent s optimistic estimates p k s,a and. Optimism on the System Dynamics Said term p k s,a ˆp k s,a represents the difference between the agent s imagined i.e., optimistic transition p k s,a and the maximum likelihood transition ˆp k s,a weighted by the next-state optimistic value function. By construction, this is the exploration bonus which ignoring log-factors and constants reads: Transition Dynamics = Exploration Optimism Bonus Dominant Term of Exploration Bonus {}}{ Var s ˆpk s,a n k s,a + H n k s,a } {{ } Empirical Bernstein 17 V + V ˆp k s,a + H nk s,a n k s,a Correction Bonus 18 In the above expression the Correction Bonus is needed to ensure optimism because the Empirical Bernstein contribution is evaluated with the agent s estimate as opposed to the real Vt+1. π If we assume that ˆp k s,a shrinks quickly enough, then the Dominant Term in equation 17 is the most slowly decaying term with a rate 1/ n. If that term involved the true transition dynamics p s,a and value function Vt+1 π as opposed to the agent s estimates ˆp k s,a and then problem dependent bounds would follow in the same way as they could be proved for the Transition Dynamics Estimation. Therefore we wish to study the relation between such Dominant Term evaluated with the agent s MDP estimates vs the MDP s true parameters. Dominant Term of the Exploration Bonus The study just discussed gives the upper bound a: Dominant a Var Term of p s,a Vt+1 π Exploration Bonus n k s,a Gives Problem Dependent Bounds + H n k s,a + V ˆp k s,a nk s,a Shrinks Faster 19 In words, we have decomposed the Dominant Term of the Exploration Bonus which is constructed using the agent s available knowledge as a problemdependent contribution that is equivalent to Bernstein Inequality evaluated as if the model was known and a term that accounts for the distance between the two and shrinks faster that the former. This is precisely the Correction Bonus that we use in equation 17 and in the definition of the Algorithm itself. What gives rise to problem dependent bounds? Our analysis highlights that it is as if Euler was using Bernstein Inequality evaluated with precise domain

9 knowledge of p s,a and Vt+1 π to construct problem dependent confidence intervals with a correction term that guarantees optimism and is rapidly decaying. This correction term ˆp k s,a is a function of the inaccuracy of the value function estimate at the next-step states re-weighted by their relative importance as encoded in the experienced transitions ˆp k s,a. Said correction term is of high value only if the successor states do not have an accurate estimate for the value function and they are going to be visited with high probability. A pigeonhole argument guarantees that this situation cannot happen for too longensuring fast decayof ˆp k s,a and therefore of the whole Correction Bonus of eq. 17. Notice that such considerations and results would not be achievable by a naive application of an Hoeffdinglike inequality as the latter would put equal weight on all successor states, but the accuracy in the estimation of Vt+1 π only shrinks in a way that depends on the visitation frequency of said successor states as encoded in ˆp k s,a. The keyto enableproblem dependent bound is, therefore, to re-weight the importance of the uncertainty on the value function of the successor states by the corresponding visitation probability, which Bernstein Inequality implicitly does. 8 Related Literature Our connection with the most closely connected prior work [BT09, FPLO18, MMM14, ZB18] has been described in the main text. Here we focus on the remaining most relevant literature. Bounds that depend on the MDP mixing property: [AO06, TM18] have bounds on the mixing time but hold for MDPs that are unichain or ergodic. Most MDPs do not possess this property. Bounds function of the gap between policies: [EMM06] has bounds dependent on the minimum gap in the optimal state action values between the best and second best action, and Ucrl2 has bounds as function of the gap in the difference in the average reward between the best and second best policies. As these gaps can be arbitrarily small, the bound approaches infinity; further such bounds do not reflect additional problemdependent structure that we and others have considered here, connected to the value function. Regret bounds with value function approximation: [OR14] uses the Eluder dimension as a measure of the dimensionality of the problem and [JKA + 17] proposes the Bellman rank to measure the learning complexity, but such measures aim at capturing a different notion of hardness than ours and do not match the lower bound in the tabular setting. 9 Future Work and Conclusion In this paper we have proposed Euler, an algorithm for episodic finite state and acton MDPs that matches the best known worst-case regret guarantees while provably obtaining much tighter guarantees if the domain has a small variance of the value function over the next-state distribution, previously defined as environmental norm [MMM14]. A key feature is that our Euler does not need to know any bound on the domain environmental norm in advance. In contrast to prior problem-dependent bounds that relied on value function overestimates Φ of the range of the value function rngv π [BT09, FPLO18] across all states, we consider a tighter quantity: the range of the optimal value function restricted to successor states Φ succ in corollary 1.2 as an upper bound to the environmental norm Q. More precisely: Q Φ succ rngv π Φ 20 implying our problem dependent regret bounds based on the environmental norm are tighter than previous proposals based on the range/span of the value function. We show that the environmental norm is low for a number of important subclasses of MDPS, including: MDPs with sparse rewards, near determinisitic MDPs, highly mixing MDPs such as those closer to bandits and some classical empirical benchmarks. We also show how our result helps answer a recent open learning theory question about the necessary dependence of sample complexity and regret results on the planning/ episode horizon. Interesting directions for future work are to see if our results can also be a function of the gaps between the Q values, which is typical of bandit analyses, as well as extending these ideas to the undiscounted infinite horizon reinforcement learning and to the far more challenging function approximation setting for which even problem independent guarantees have proved to be mostly elusive. References [AMK12] Mohammad Gheshlaghi Azar, Remi Munos, and Hilbert J. Kappen. On the sample complexity of reinforcement learning with a generative model. In ICML, 2012.

10 [AO06] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In NIPS, [AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In ICML, [BCB12] Sébastien Bubeck and Nicolò Cesa- Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, [BT09] [DB15] Peter L. Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, Christoph Dann and Emma Brunskill. Sample complexity of episodic fixedhorizon reinforcement learning. In NIPS, [DLB17] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In NIPS, [EMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, [FPLO18] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained explorationexploitation in reinforcement learning [JA18] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages , [JKA + 17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langforda, and Robert E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In ICML, [JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, [KWY18] Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcement learning. Arxiv, [LH14] Tor Lattimore and Marcus Hutter. Nearoptimal pac bounds for discounted mdps. In Theoretical Computer Science, [MMM14] Odalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. how hard is my mdp? the distribution-norm to the rescue. In NIPS, [MP09] [OR14] [OR17] [OV16] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. In COLT, Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In NIPS, Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ICML, Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. In Arxiv, [OVR13] Ian Osband, Benjamin Van Roy, and Daniel Russo. more efficient reinforcement learning via posterior sampling. In NIPS, [SB98] [TM18] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Mohammad Sadegh Talebi and Odalric- Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In ALT, [WOS + 03] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J. Weinberger. Inequalities for the l1 deviation of the empirical distribution. Technical report, Hewlett-Packard Labs, [WV13] [ZB18] Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In NIPS, Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identify bandit structure in mdps. In ICML, 2018.

11 Contents 1 Introduction 1 2 Preliminaries 2 3 Euler 3 4 Main Result Problem Dependent Regret Bound Planning Horizon Dependence in Dominant Term 5 6 Problem dependent bounds Bounds using the range of optimal value function Bounds on the next-state variance of V π and empirical benchmarks Stochasticity in the system dynamics Sketch of the Theoretical Analysis 8 8 Related Literature 9 9 Future Work and Conclusion 9 A Short Proofs Missing from the Main Text 12 A.1 Euler on Chain A.2 Euler on Tabular Contextual Bandits B Appendix Overview and Proof Preview 13 B.1 Algorithm B.2 Optimism B.3 Confidence Interval for the Value Function B.4 Regret Bound C Failure Events and their Probabilities 16 C.1 Empirical Bernstein Inequality for the Rewards C.2 Admissible Confidence Intervals on the Transition Dynamics C.3 Bernstein s Inequality C.4 Other Failure Events and Their Probabilities D Optimism 19 D.1 Rewards D.2 Transition Dynamics

12 D.3 Algorithm is Optimistic E Delta Optimism 22 F The Good Set L tk 24 G Regret Analysis 25 G.1 Main Result G.2 Regret Bounds with Bernstein Inequality G.3 Regret Bound in Deterministic Domain with Bernstein Inequality G.4 Reward Estimation and Optimism G.5 Transition Dynamics Estimation G.6 Transition Dynamics Optimism G.7 Lower Order Term G.8 Auxiliary Lemmas A Short Proofs Missing from the Main Text A.1 Euler on Chain Chain MDPs are commonly given as examples of challenging exploration domains because simple strategies like ǫ-greedy can take an exponential time to learn. We now discuss an intriguing result for the chain shown in figure 1 which is nearly identical to a previously introduced one [OR17]. Like in that domain, each episode is N = H = S timesteps long. The optimal policy is to go right which yields a reward of 1 for the episode. The transition probabilities are assigned in way that scales the optimal value function so that it is of order 1. Since the rewards are deterministic we immediately obtain up to a constant that does not depend on N: Q maxvarqs,a s,a max Var Vt+1s π + s,a 1 s,a,t s,a,t N 1 1 N 1 N since Vt+1s π + is essentially dominated by a Bernoulli random variable with success parameter 1 1/N times an appropriate scaling factor of order one. Therefore Euler s regret 4 is dominated by a term which is Õ Q SAT = Õ 1 T N N 2 T = Õ. Notice that the lower order term should be added to the above expression and this is likely to be significant particularly for small T. However, we remark that the result above follows directly from Theorem 1 whose proof does not attempt to make the lower order term problem-dependent. Our result is substantially smaller than the typically reported bounds for this case, which are dominated by a Õ HSAT term. There are two main factors that lead to the above simplification for this class of MDPs: V π is of order 1 and not N = H like in hard-to-learn MDPs which yields the lower bounds [DB15] and also the variance decreases as we let N increase, each of which removes a factor of N from the known worst-case bound Õ HSAT = Õ N 2 AT after substituting N = S = H. To enable Euler s level of performance on this and other MDPs one needs to carefully control both the confidence intervals of the rewards and of the transition probabilities as Euler does, since Hoeffding-like concentration inequalities for the rewards alone would already induce a Θ SAT = Θ NT contribution to the regret expression. This result is intriguing because it suggests that truly pathological MDP classes which induce Ω HSAT regret are even more uncommon than previously thought. 4 This is valid for any class of MDP which shares these properties, implying that the regret for the MDP in figure 1 can be even smaller 21

13 A.2 Euler on Tabular Contextual Bandits Contextual multi-armed bandits are a generalization of the multiarmed bandit problem, see for example [BCB12] for a survey. In their simplest possible formulation they entail a discrete set of contexts or states {s = 1,2,...} and actions {a = 1,2,...} and the expected reward rs,a depends on both the state and action. After playing an action, the agent transitions to the next states according to some fixed distribution µ R S over which the agent has no control. In principle, such problem can be recast as an MDP in which the next state is independent of the prior state and action. Consider an H-horizon MDP which maps to a tabular contextual bandit problem: the transition probability is identical ps s,a = µs for all states and actions, where µ is a fixed distribution from which the next states are sampled. In such MDP Define the best and worst context at time t, respectively: def s t = argmaxvt def s and s t = argminvt s and recall that the transition dynamics p s,a = µ depends nor on the action a nor on the current state s. We have that: { V t s t = max a rst,a+µ Vt+1 Vt s t = max a rst,a+µ Vt+1 22 which immediately yields a bound on the range of the value function of the successor states: rngv π t+1 = V t s t V t s t = max a rs t,a maxrs a t,a 1 23 where the last inequality follows from the fact that rewards are bounded r, [0,1]. Therefore Φ 1 and corollary 1.2 yields a high probability regret upper bound of order Õ SAT + SSAH 2 S +H 24 for Euler. This means that Euler automatically attains the lower bound in the dominant term for tabular contextual bandits [BCB12] if deployed in such setting. We did not try to improve the lower order terms for this specific setting, which may give an improved bound. B Appendix Overview and Proof Preview We start by giving an overview of the proof that leads to the main result. The setting considered in the appendix is more general than the one presented in the main text. In particular we 1 define a class of concentration inequalities for the transition dynamics 2 show that Euler achieves strong problem dependent regret bounds with any concentration inequality satisfying these assumptions. In particular, the main result presented in the main text follows as a corollary of the potentially more general analysis presented in the appendix. We now give a preview of the proof of the main result, assuming rewards are known. We start be recalling Euler with yet-to-specify confidence intervals on the transition dynamcs. B.1 Algorithm The algorithm is presented in figure 2. B.2 Optimism The goal of this section is to show that Euler guarantees optimism with high probability. As is well known, optimism is the driver of exploration as it allows to overestimate the regret by a concentration term with high probability. Let s consider the planning process at the beginning of episode k. This is detailed in lines 4 to 16 of algorithm 1. In order to guarantee finding a pointwise optimistic value function tk V t π a bonus is added to account for bad luck in the system dynamics experienced up to episode k. If the agent knew the value of the confidence intervalfor the system dynamicsφp s,a,vt+1 π then optimism could be inductively guaranteedi.e., assuming that, by induction, V t+1 π holds pointwise if said confidence interval holds: tk = rs,a+ ˆp k s,a π +φp s,a,vt+1 25

14 Algorithm 2 Euler for Stationary Episodic MDPs 1: Input: confidence interval b r k, φ, with failure probability δ, constants B p,b v. 2: ns,a = r sum s,a = p sum s,s,a = 0, s,a S A; V H+1 s = 0, s S 3: for k = 1,2,... do 4: for t = H,H 1,...,1 do 5: for s S do 6: for a A do 7: ˆp = psum,s,a ns,a 8: b pv k = φˆps,a,v 1 4J+B t+1+ p +B ns,a nk v V t+1 V s,a t+1 2,ˆp 9: Qa = min{h t,ˆr k s, s,t+b r k s,a+ ˆp V t+1 +b pv k } 10: end for 11: s,t = argmax a Qa 12: V t s = Q s,t 13: b pv k = φˆps, π 1 4J+B ks,t,v t+1 p ns, πk s,t nk v V t+1 V s, s,t t+1 2,ˆp 14: V t s = max{0,ˆr k s, s,t b r k s,s,t+ ˆp V t+1 b pv k } 15: end for 16: end for 17: s 1 p 0 18: for t = 1,...H do 19: a t = s t,t; r t p R s t,a t ; s t+1 p P s t,a t 20: ns t,a t ++; p sum s t+1,s t,a t ++ 21: end for 22: end for rs,a+ ˆp k s,a Vt+1 π π +φp s,a,vt+1 rs,a+p s,a Vt+1 π V t+1 π 26 If the above conclusion is true for every action then it is true for the maximizer as well. Unfortunately the agent knows nor the real transition dynamics nor the optimal value function to evaluate φ. Instead, it only has access to the estimated transition dynamics ˆp k s,a and to an overestimate of the value function. Unfortunately the confidence interval φ evaluated with such quantities φˆp k s,a, is not guaranteed to overestimate φp s,a,vt+1 π and optimism may not be guaranteed. To remedy this the agent can try to estimate the difference φˆp k s,a, π φp s,a,vt+1 27 and add a correction term to account for that difference. A similar problem is faced in [AOM17] where the authors propose an optimistic bonus which guarantees optimism when using the empirical Bernstein Inequality. By distinction, our way of constructing the bonus works with any concentration inequality satisfying assumption 1 and 2. Precisely, optimism is dealt with in section D in the appendix; in lemma 4 we show how to bound equation 27 obtaining the result below: φˆp k s,a,v φp s,a,vt+1 π B v V V π nk s,a t+1 2,ˆp + B p +4J n k s,a. 28 This is essentially a consequence of the definition of admissible bonus, i.e., def 3. Unfortunately the upper bound in equation 28 depends on Vt+1 π which is not known, so the problem is still unsolved. However, as we show in lemma 5 in the appendix it suffices to pointwise overestimate Vt+1. π To this aim, the algorithm maintains an underestimate of Vt+1 π which we call. Equipped with this underestimate, we define the Exploration Bonus in definition 5 in the appendix, which we report below: b pv k ˆp k s,a,, def = φˆp k s,a, + 4J +B p n k s,a + B v V 2,ˆp nk s,a. 29

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Improving PAC Exploration Using the Median Of Means

Improving PAC Exploration Using the Median Of Means Improving PAC Exploration Using the Median Of Means Jason Pazis Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 039, USA jpazis@mit.edu Ronald Parr Department

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

An Analysis of Model-Based Interval Estimation for Markov Decision Processes An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Selecting the State-Representation in Reinforcement Learning

Selecting the State-Representation in Reinforcement Learning Selecting the State-Representation in Reinforcement Learning Odalric-Ambrym Maillard INRIA Lille - Nord Europe odalricambrym.maillard@gmail.com Rémi Munos INRIA Lille - Nord Europe remi.munos@inria.fr

More information

A Theoretical Analysis of Model-Based Interval Estimation

A Theoretical Analysis of Model-Based Interval Estimation Alexander L. Strehl Michael L. Littman Department of Computer Science, Rutgers University, Piscataway, NJ USA strehl@cs.rutgers.edu mlittman@cs.rutgers.edu Abstract Several algorithms for learning near-optimal

More information

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes Journal of Computer and System Sciences 74 (2008) 1309 1331 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.elsevier.com/locate/jcss An analysis of model-based Interval

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

PAC Model-Free Reinforcement Learning

PAC Model-Free Reinforcement Learning Alexander L. Strehl strehl@cs.rutgers.edu Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora ewiewior@cs.ucsd.edu Computer Science

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Online regret in reinforcement learning

Online regret in reinforcement learning University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Efficient Exploration and Value Function Generalization in Deterministic Systems

Efficient Exploration and Value Function Generalization in Deterministic Systems Efficient Exploration and Value Function Generalization in Deterministic Systems Zheng Wen Stanford University zhengwen@stanford.edu Benjamin Van Roy Stanford University bvr@stanford.edu Abstract We consider

More information

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Ronald Ortner Department Mathematik und Informationstechnologie, Montanuniversität Leoben, A-8700 Leoben, Austria Abstract

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

(More) Efficient Reinforcement Learning via Posterior Sampling

(More) Efficient Reinforcement Learning via Posterior Sampling (More) Efficient Reinforcement Learning via Posterior Sampling Osband, Ian Stanford University Stanford, CA 94305 iosband@stanford.edu Van Roy, Benjamin Stanford University Stanford, CA 94305 bvr@stanford.edu

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

arxiv: v1 [cs.lg] 12 Feb 2018

arxiv: v1 [cs.lg] 12 Feb 2018 in Reinforcement Learning Ronan Fruit * 1 Matteo Pirotta * 1 Alessandro Lazaric 2 Ronald Ortner 3 arxiv:1802.04020v1 [cs.lg] 12 Feb 2018 Abstract We introduce SCAL, an algorithm designed to perform efficient

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

arxiv: v2 [cs.lg] 1 Dec 2016

arxiv: v2 [cs.lg] 1 Dec 2016 Contextual Decision Processes with Low Bellman Rank are PAC-Learnable Nan Jiang Akshay Krishnamurthy Alekh Agarwal nanjiang@umich.edu akshay@cs.umass.edu alekha@microsoft.com John Langford Robert E. Schapire

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

arxiv: v2 [stat.ml] 17 Jul 2013

arxiv: v2 [stat.ml] 17 Jul 2013 Regret Bounds for Reinforcement Learning with Policy Advice Mohammad Gheshlaghi Azar 1 and Alessandro Lazaric 2 and Emma Brunskill 1 1 Carnegie Mellon University, Pittsburgh, PA, USA {ebrun,mazar}@cs.cmu.edu

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Journal of Machine Learning Research (2006 Submitted 2/05; Published 1 Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Eyal Even-Dar evendar@seas.upenn.edu

More information

Value Directed Exploration in Multi-Armed Bandits with Structured Priors

Value Directed Exploration in Multi-Armed Bandits with Structured Priors Value Directed Exploration in Multi-Armed Bandits with Structured Priors Bence Cserna Marek Petrik Reazul Hasan Russel Wheeler Ruml University of New Hampshire Durham, NH 03824 USA bence, mpetrik, rrussel,

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Bayesian Reinforcement Learning with Exploration

Bayesian Reinforcement Learning with Exploration Bayesian Reinforcement Learning with Exploration Tor Lattimore 1 and Marcus Hutter 1 University of Alberta tor.lattimore@gmail.com Australian National University marcus.hutter@anu.edu.au Abstract. We consider

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning Selecting Near-Optimal Approximate State Representations in Reinforcement Learning Ronald Ortner 1, Odalric-Ambrym Maillard 2, and Daniil Ryabko 3 1 Montanuniversitaet Leoben, Austria 2 The Technion, Israel

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

PAC Subset Selection in Stochastic Multi-armed Bandits

PAC Subset Selection in Stochastic Multi-armed Bandits In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram

More information

Anytime optimal algorithms in stochastic multi-armed bandits

Anytime optimal algorithms in stochastic multi-armed bandits Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Sample Complexity of Multi-task Reinforcement Learning

Sample Complexity of Multi-task Reinforcement Learning Sample Complexity of Multi-task Reinforcement Learning Emma Brunskill Computer Science Department Carnegie Mellon University Pittsburgh, PA 1513 Lihong Li Microsoft Research One Microsoft Way Redmond,

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model Aaron Sidford Stanford University sidford@stanford.edu Mengdi Wang Princeton University mengdiw@princeton.edu

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Policy Gradient Reinforcement Learning Without Regret

Policy Gradient Reinforcement Learning Without Regret Policy Gradient Reinforcement Learning Without Regret by Travis Dick A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Computing Science University

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Speedy Q-Learning. Abstract

Speedy Q-Learning. Abstract Speedy Q-Learning Mohammad Gheshlaghi Azar Radboud University Nijmegen Geert Grooteplein N, 6 EZ Nijmegen, Netherlands m.azar@science.ru.nl Mohammad Ghavamzadeh INRIA Lille, SequeL Project 40 avenue Halley

More information

Bellmanian Bandit Network

Bellmanian Bandit Network Bellmanian Bandit Network Antoine Bureau TAO, LRI - INRIA Univ. Paris-Sud bldg 50, Rue Noetzlin, 91190 Gif-sur-Yvette, France antoine.bureau@lri.fr Michèle Sebag TAO, LRI - CNRS Univ. Paris-Sud bldg 50,

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Why is Posterior Sampling Better than Optimism for Reinforcement Learning? Ian Osband 12 Benjamin Van Roy 1 Abstract Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms existing algorithms driven by optimism, such

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Why is Posterior Sampling Better than Optimism for Reinforcement Learning? Journal of Machine Learning Research () Submitted ; Published Why is Posterior Sampling Better than Optimism for Reinforcement Learning? Ian Osband Stanford University, Google DeepMind Benjamin Van Roy

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information