arxiv: v1 [cs.lg] 1 Jan 2019
|
|
- Regina Gardner
- 5 years ago
- Views:
Transcription
1 Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds arxiv: v1 [cs.lg] 1 Jan 2019 Andrea Zanette Institute for Computational and Mathematical Engineering, Stanford University Abstract Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problem-dependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the nextstate value functions. An important benefit of our algorithmic is that it does not require apriori knowledge of a bound on the environmental norm. As a result of our analysis, we also help address an open learning theory question [JA18] about episodic MDPs with a constant upper-bound on the sum of rewards, providing a regret bound with no H-dependence in the leading term that scales a polynomial function of the number of episodes. 1 Introduction Emma Brunskill Department of Computer Science, Stanford University In reinforcement learning RL an agent must learn how to make good decision without having access to an exact model of the world. Most of the literature for provably efficient exploration in Markov decision processes MDPs [JOA10, OVR13, LH14, DB15, DLB17, OR17, AOM17, KWY18] has focused on providing near-optimal worst-case performance bounds: such bounds are highly desirable as they do not depend on the structure of the particular environment considered and therefore hold for even extremely hardto-learn MDPs. Fortunately in practice reinforcement learning algorithms often perform far better than what these problem-independent bounds would suggest. While we may observe better or worse performance empirically on different MDPs, we would like to derive a more systematic understanding of what types of decision processes are inherently easier or more challenging for RL. This motivates our interest in deriving algorithms and theoretical analyses that provide problemdependent bounds. Ideally, such algorithms will do as well as RL solutions designed for the worst case if the problem is pathologically difficult and otherwise match the performance bounds of algorithms specifically designed for a particular problem subclass. This exciting scenario might bring considerable saving in the time spent designing domain-specific RL solutions and in training a human expert to judge and recognize the complexity of different problems. An added benefit would include the robustness of the RL solution in case the actual model does not belong to the identified subclass, yielding increased confidence to deploying RL to high-stakes applications. Towards this goal, in this paper we contribute a new algorithm for episodic tabular reinforcement learning which automatically provides provably stronger regret bounds in many domains which have a smaller environmental norm [MMM14], which is a characterization of the variance of the optimal value function. Indeed, there is good reason to believe that some features of the range or variability of the optimal value function should be a critical aspect of the hardness of reinforcement learning in a MDP. Many worst-case bounds for
2 finite-state MDPs scale with a worst case bound on the range / magnitude of the value function. For example, Ucrl2 from [JOA10] and a recently proposed algorithm from [AOM17] enjoy a regret bound 1 of Õ DS AT Undiscounted [JOA10] Õ HSAT Fixed Horizon [AOM17] 1 where D is the diameter for an infinite-horizon setting and H, the horizon in an episodic problem, respecctively. Note that here both D and H arise in the analyses as upper bounds on the range of the optimistic value function across the entire MDP. In fact, many RL algorithms with strong performance bounds rely on the principle of optimism under uncertainty and compute an optimistic value function. As more samples are collected, one would hope that the agent s optimistic value function converges to the true optimal value function. Unfortunately this is not the case, see for example [JOA10, BT09, ZB18] for a discussion of this. As a result, most prior analyses bounded the optimistic value function by generic quantities like D or H regardless of the actual behaviour of the optimal value function. While the majority of formal performance guarantees has focused on bounds for the worst case, there have been several contributions of algorithms and/or theoretical analysis focused on MDPs with particular structure. One quantity of interest is the range of the optimal value function, and prior work [BT09] proposed an algorithm that can replace the diameter D with an overestimate of range of the optimal value function across all states. A tractable version this algorithm was subsequently derived [FPLO18]. However, while these approaches provide strong regret bounds that highlight the better performance that can be obtained in domains with a small range in the optimal value function, these approaches assume apriori domain knowledge of the range of the optimal value function to attain improved performance. In contrast, [MMM14] introduced a slightly different quantity, the environmental norm, that depends on the variance of only the optimal value function across successor states. The authors demonstrate this measure of MDP complexity empirically explains the complexity of doing RL in a number of common simulation benchmarks. However, the algorithm and analysis they derive do not enjoy an improved regret bound compared to the available worst-case analyses, nor do they provide performance bounds for specific domain subclasses, so their primary contribution is a good complexity measure that can help empirically explain the different hardness of various RL problems. 1 leading order term only In our paper we build on this promising definition of domain complexity, the environmental norm, but go significantly further to show that we can derive an algorithm for finite horizon discrete MDPs and associated analysis that yields state-of-the art worst-case regret bounds of order Õ HSAT in the leading term all the while improving if the environment has small environmental norm. Compared to the existing literature, our work Maintains state of the art worst-case guarantees [AOM17, KWY18], Improves on the regret bounds of[bt09, FPLO18] when restricted to the episodic setting and without the need of domain knowledge of a bound on the range of the value function, Improves the regret bounds of [ZB18] when deployed in the same settings, Provides the first demonstration that characterizing problems using environmental norm[mmm14] can yield substantially tighter theoretical guarantees, and Identifies significant problem classes with low environmental norm which are of significant interest, including deterministic domains, single-goal MDPs, and high stochasticity domains. Helps address an open learning theory problem [JA18], showing that for their setting, we obtain a regret bound that scales with no dependence on the planning episode horizon in the leading term that is polynomial function of the number of episodes. The paper is organized as follows: we recall some basic definitions in Section 2 and describe the algorithm in Section 3. We state and comment the main result in Section 4, discuss how this helps address an open learning theory problem in Section 5 and then describe selected problem-dependent bounds in Section 6. The analysis is sketched in Section 7. Due to space constraints, most proofs are deferred to the appendix. 2 Preliminaries In this section we introduce some notation and definitions commonly adopted in RL. We consider undiscounted finite horizon MDPs [SB98], which are defined by a tuple M = S,A,p,r,H, where S and A are the state and action spaces with cardinality S and A, respectively. We denote by ps s,a the probability of transitioning to state s after taking action a
3 in state s while rs,a [0,1] is the average instantaneous reward collected. We label with n k s,a the visits to the s,a pair at the beginning of the k-th episode. The agent interacts with the MDP starting from arbitrary initial states in a sequence of episodes k [K] where [K] = {j N : 1 j K} of fixed length H by selecting a policy which maps states s and timesteps t to actions. Each policy identifies a value function for every state s and timestep t [H] defined as V π k H t s t = E s,a πk i=trs,a which is the expected return until the end of the episode the conditional expectation is over the pairs s, a encountered in the MDP upon starting from s t. The optimal policy is indicated with π and its value function as Vt π. We indicate with and, respectively, a pointwise underestimate, respectively, overestimate, of the optimal value function and with ˆp k s,a and ˆr k s,a the MLE estimates of p s,a and rs,a. We focus on deriving a high probability upper bound on the RegretK def = k [K] V1 π s k 1 s k to measure the agent s learning performance. We use the Õ notation to indicate a quantity that depends on up to a polylog expression of a quantity at most polynomial in S,A,T,K,H, 1 δ. We also use the,, notation to mean,,=, respectively, up to a numerical constant and indicate with X 2,p the 2-norm of a random variable 2 under p, i.e., def X 2,p = E p X 2 def = s ps X 2 s if p is its probability mass function. 3 Euler We introduce the algorithm Episodic Upper Lower Exploration in Reinforcement learning Euler which adopts the paradigm of optimism under uncertainty to conduct exploration. Recent work [DB15, DLB17, AOM17, KWY18] has demonstrated how the choice of the exploration bonus is critical to enabling tighter problem-independent performance bounds. In particular, tighter confidence bounds on the expected future rewards, ˆp k s,a p s,a Vt+1, π obtained by leveraging Bernstein s inequality, has lead to significant improvements. However, as the true transition model p s,a and next state values Vt+1 π are unknown, existing algorithms must rely on the empirical estimates of these quantities to construct tight confidence intervals but must do so in a way that preserves optimism which is a critical property of most theoretical analyses. This has given rise to various methods that introduce additional correction terms to ensure optimism. 2 To be precise, this is a norm between classes of random variables that are almost surely the same In our algorithm we use the empirical Bernstein Inequality [MP09] for estimating both the rewards and transition dynamic deviation terms b r k s,a and φp,v in the algorithm, respectively with a different correction term. This allows Euler to preserve in the dominant term the best problem-independent bound Õ HSAT for this setting while simultaneously attaining much stronger problem-dependent bounds than have been previously obtained. The key insight is that by maintaining point-wise consistent upper and lower bounds directly on V π, we can define a bonus term that both guarantees optimism, decays quickly and is a function of the underlying problem. We provide pseudocode for Euler which details the main procedure in figure 1. Notice that Euler has the same computational complexity as value iteration. 4 Main Result Before stating the result, we recall the definition of environmental norm which first appeared in [MMM14]. The definition advanced in [MMM14] is for infinite horizon RL but can very easily be adapted to fixed horizon RL while including the reward variances: Definition 1 Environmental Norm. The environmental norm Q of an MDP is the maximum conditional variance conditioned on the s, a pair of the immediate reward R, random variable and nextstate optimal value function: Q def = max s,a VarRs,a+ Var V π s + ps,a t+1s +. 2 Intuitively, if Q is small then it is not difficult to estimatetheq t valuesfromtheavailablesamplesprovided that Vt+1 π is known. 4.1 Problem Dependent Regret Bound Now we present our main result, which is a problemdependent high-probability regret upper bound for Euler and a worst-case guarantees that matches the established [OV16, JOA10] lower bound of Ω HSAT in the dominant term. Before this, we define the maximum return under any policy and any realization of the rewards and transitions within an episode as G. In other words, G is a deterministic upper bound on the random total reward collected within an episode. Theorem 1 Problem Dependent High Probability Regret Upper Bound for Euler. With probability at least 1 δ the regret of Euler is bounded for any time T KH by the minimum between O polylogs,a,t,1/δ Q SAT + SSAH 2 S +H 3
4 Algorithm 1 Euler for Stationary Episodic MDPs = 1 6 1: Input: failure probability δ, δ def 2: φp,v def = 2 Var pv ln 4SAT δ n k s,a 3: for k = 1,2,... do 4: for t = H,H 1,...,1 do 5: for s S do 6: for a A do 7: b pv k = φˆps,a,v t δ, L def = 2ln12SAT/δ, V H+1 = V H+1 = 0 7H ln 4SAT δ 3n k s,a, b r k s,a = ns,a 2 VarRs,aln 4SAT δ n k s,a 28HL +L V t+1 V 3 n k s,a t+1 ˆps,a + 7ln 4SAT δ 3n k s,a 8: Qa = min{h t,ˆr k s,t+b r k s,a+ ˆp V t+1 +b pv k } 9: end for 10: s,t = argmax a Qa, V t s = Qs, π k s,t 11: b pv k = φˆps, πs,t,v 1 t+1 28HL +L V t+1 V ns,a 3 n k s, s,t t+1 ˆps, πs,t 12: V t s = max{0,ˆr k s, s,t b r k s,s,t+ ˆp V t+1 b pv k } 13: end for 14: end for 15: Evaluate policy and update MLE estimates ˆp, and ˆr, 16: end for and directly implies that Euler can leverage the structure G 2 O polylogs, A, T, 1/δ H SAT + SSAH 2 of a number of commonly encountered MDPs. S +H, Maillard et al. [MMM14] also propose an algorithm 4 that depends on this useful metric of hardness and jointly for all episodes k [K]. show a regret bound whose leading order term is: Since the rewards are in [0,1], we immediately have that G 2 H 2, and obtain the following corollary for the worst-case regret bound: Corollary 1.1. With probability at least 1 δ the regret of Euler is bounded for any time T KH by Õ HSAT + SSAH 2 S +H. 5 Note that this matches in the dominant term the minimax regret problem independent bounds for tabular episodic RL settings [AOM17]. Therefore, the importance of our theorem 1 lies in providing problem dependent bounds equation 3,4 while simultaneously matching the existing best worst case guarantees equation 5. In addition, we shall shortly see that our results also provide tighter results than prior work for a general alternate definition of episodic MDPs that was recently introduced [JA18]. The key quantity that encodes the MDP complexity is Q, the environmental norm of [MMM14]. Notice that the algorithm does not need to be informed of the value of Q. Prior empirical evaluation of Q in [MMM14] has shown in a number of common empirical benchmarks Q has small value: see [MMM14] for a description and references. This empirical evaluation of the environmental norm is encouraging, because it 1 Õ DS Q p AT. 6 0 Unfortunately to our understanding their submission does not improve over worst-case analyses for that setting since the worst-case value function upper bound which is the diameter D in their setting and H in ours is still a factor that shows up in the leading order regret contribution. Said factor is instead replaced altogether by the environmental norm in our analysis. In addition, the inverse of the minimum non-zero transition probability p 0 = min s,s,aps s,a appears in the regret bound. As this can be arbitrarily small, the bound becomes vacuous. Further, their algorithmmustknownp 0. Forthesereasons,theauthors of [MMM14] recognize that their main contribution is mostly about identifying an useful norm and its dual that characterizes the hardness of a number of empirical problems, but the question of whether the environmental norm is a useful quantity from a theoretical point of view was essentially left as open problem. Indeed our contribution can be viewed as a realization of that vision, i.e., proposing an algorithm and analysis which order-wise scale with the environmental norm.
5 5 Planning Horizon Dependence in Dominant Term Before discussing the implications of our above result in several common classes of Markov decision processes with particular different forms of structure, we first show that these new results can help address a recent posed open question in the learning theory community [JA18]. The question posed centers on the whether there should exist a necessary dependence of sample complexity and regret lower bounds on the planning horizon H for episodic tabular MDP reinforcement learning tasks. Existing lower bound results for sample complexity [DB15] do not have a dependence on the horizon, and the best existing minimax regret bounds do have a dependence on the horizon under asymptotic assumptions[aom17]; however, such results have been derived under the common assumption of reward uniformity, that per-time-step rewards are bounded between 0 and 1. Jiang and Agarwal [JA18] instead pose a more general setting, in which they assume that per time step rewards r h 0 and H h=1 r h [0,1] holds almost surely: note the standard setting of reward uniformity can be expressed in this setting by first normalizing all rewards by dividing by H. The authors then ask that if in this new, more general setting of tabular episode RL, if there is necessarily a dependence on the planning horizon in the lower bounds. 9 Note that the planning/episodic horizon H does not appear in the dominant regret term which scales polynomially with the number of episodes K, and only appears in terms that scale logarthmically with the number of episodes K, or as part of transient lower order terms that are independent of K. In other words, up to logarithmic dependency and transient terms, we have an upper bound on episodic regret that is independent of the horizon H. This result then answers part of Jiang and Agarwal s open question: for their setting, we have shown regret scales independently of the horizon H in the dominant term. Notice that our algorithm has no dependence on H that scales as polynomial function of K, and our algorithm does not need to be provided with information about the maximum possible value function. Jiang and Agarwal [JA18] also ask about the dependence of a lower bound on the sample complexity on the planning horizon. While our work focuses on a regret analysis, and does not provide PAC sample complexity results, we can use our regret results to bound with high probability the number of episodes needed to ensure that the average per-episode regret is less than ǫ To do so, we obtain the average per-episode loss of Euler by dividing by the number of episodes K: O polylogs,a,k,h,1/δ SA K + SSAH 2 S +H From here we can seek for the smallest K such that the average error is smaller than ǫ, obtaining: SA SSAH 2 O polylogs,a,k,h,1/δ ǫ + S +H 2 ǫ 11 episodes before the average per-episode error is smaller than ǫ. For small ǫ << SA, the first term dominates, which is again independent of a polynomial dependence on H. Of course, in order to formally obtain a PAC result a worst case upper bound on the number Their assumptions immediately imply that of ǫ-suboptimal episodes the algorithm would need to be modified to be PAC. In particular, the exploration 0 Vt π s G 1, s,t S [H]. 7 bonus for a given s,a pair should be designed so it does not increasewith time T if s,a is not visited. In Further, since Vt π s 1 and rs,a 0 we must practice this means replacing the logt factor of the have rs,a [0,1], which is the assumption of this exploration bonuses with something like logn where work. Therefore our main result theorem 1 applies n is the visit count to a specific state-action pair and here. Recalling that T = KH, we obtain an upper bound on regret as adjusting the numerical constant to make sure the exploration bonuses / confidence intervals are still valid 1 O polylogs,a,k,h,1/δ H SAKH + SSAH 2 with high probability. Please see [DLB17] for a detailed explanation of how to proceed with the algo- S +H rithm design and analysis is this case 8 3. = O polylogs,a,k,h,1/δ SAK + SSAH 2 S +H. It remains an open question whether we could further K 10 avoid either a dependence on in our regret bounds on the planning horizon either in the transient terms or in the log dependencies. However, these results are promising, since they suggest that the hardness of learning in sparse reward, and long horizon episodic MDP environments, may not be fundamentally much harder than shorter horizon domains, as long as the total reward is bounded. We next demonstrate that several classes of episodic MDPs have additional structure that also guarantee 3 Indeed, the algorithm described in [DLB17] is structurally similar to ours.
6 better performancelower regret bounds, even though our algorithm need not be aware of such structure. 6 Problem dependent bounds We now focus on deriving regret bounds for selected MDP classes that are very common in RL. We emphasize that such setting-dependent guarantees are obtained with the same algorithm which is agnostic to the particular MDP at hand and its Q value. Although the described settings share common features and are sometimes subclasses of one another, they are emphasized in separate subsections due to their important relation to the past literature as well as their practical relevance. Importantly, they are all characterized by low environmental norm. 6.1 Bounds using the range of optimal value function To improve over the worst case bound in infinite horizonrltherehavebeenapproachesthataimatobtaining stronger problem dependent bounds if the value function does not vary much across different states of the MDP. If rngv π is smaller than the worst-case either H or D for the fixed horizon vs recurrent RL, the reduced variability in the expected return suggests that we can lower the exploration effort by constructing tighter confidence intervals. This is achieved in [BT09] by providing additional information to their algorithm Regal, achieving a regret bound of order: ÕΦS AT 12 where Φ rngv π is an overestimate of the optimal value function range and is an input to the algorithm described in that paper. This means that if domain knowledge is available and is supplied to the algorithm the regret can be substantially reduced. This line of research was followed in [FPLO18] which derived a computationally-tractable variant of Regal. However, they still require knowledge of a value function rangeupper bound Φ rngv π. Specifying a too high value for Φ would increase the regret and a too low value would cause the algorithm to fail. Although restricted to the episodic setting, our analysis implies that it is possible to achieve at least the same but potentially much better level of performance without knowing the optimal value function range to construct the exploration bonus. This follows as an easy corollary of our main regret upper bound theorem 1 after bounding the environmental norm, as we discuss below. Let S s,a be the set of immediate successor states after one transition from state s upon taking action a there, thatis, thestatesinthesupportofp s,aanddefine Φ succ def = max rng Vt+1s π + 13 s,a s + S s,a as the maximum value function range when restricted to the immediate successor states. Since the variance is upper bounded by one fourth of the square range of a random variable we have that: Q def = max max s,a s,a VarRs,a s,a+ Var [rngrs,a] 2 +[ rng V π s + S s,a V π s + ps,a t+1 s+ t+1 s+ ] 2 1+Φ 2 succ. This immediately yields a result comparable to that of [BT09, FPLO18] when restricted to the fixed horizon framework: Corollary 1.2 Bounded Range of V π Among Successor States. With probability at least 1 δ, the regret of Euler is bounded by: ÕΦ succ SAT + SSAH 2 S +H. 14 Few remarks are in order: the problem independent bound D or H is replaced by a problem dependent quantity as in [BT09, FPLO18] Euler does not need to know the value of Φ succ or of the environmental norm or of the value function range to attain the improved bound Φ succ can be much smaller than rngv π because it is the range of V π restricted to few successor states as opposed to across the whole domain, and therefore it is always smaller than the overestimateφusedin[bt09,fplo18], inotherwords: Φ rngv π Φ succ. 15 [BT09, FPLO18] consider the more challenging infinite horizon setting. Our results holds for fixed horizon RL. It s an interesting research question whether our result can be extended to the recurrent RL setting considered by [BT09, FPLO18]. 6.2 Bounds on the next-state variance of V π and empirical benchmarks As we discuss throughout this section, the environmental norm together with our new result in Theorem 1 can help explain the hardness of subclasses of MDP domains. Before considering another important
7 subclass in the next subsection, we first wish to highlight that the environmental norm also can empirically characterize the hardness of RL in single problem instances. This was one of the nice key contributions of the work that introduced the envirornmental norm [MMM14], which evaluated the environmental norm for a number of common RL benchmark simulation tasks including mountain car, pinball, taxi, bottleneck, inventory and red herring see the table in paragraph 3.2 in [MMM14]. In these domains the environmental norm is nicely correlated with the complexity of reinforcement learning in these environments, as evaluated empirically. Indeed, in these settings, the environmental norm is often much smaller then the maximum value function range, which can itself be much smaller than the worst-case bound D or H. Our new results provide solid theoretical justification for the observed empirical savings. This measure of MDP complexity also intriguingly allows us to gain more insight on another important simulation domain, chain MDPs like that in Figure 1. Chain MDPS have been considered a canonical example of challenging hard-to-learn RL domains, since naive strategies like ǫ greedy can take exponential time to attain satisfactory performance. By setting for simplicity N def = S = H efficient exploration strategies for example [AOM17, KWY18] should incur Õ HSAT = Õ N 2 T regret in the leading order term. In contrast our Euler only incurs leading order term only Õ T leading orderregret. This is intriguing because it reinforces the idea that truly pathological MDPs may be even less common than expected. More details about this example are given in appendix A.1 r = 1 4N 1 N 1 1 N s1 s2 s3 sn 2 sn 1 sn 1 1 N 1 1 N 1 1 N 1 1 N N 1 1 N 1 1 N 1 1 N 1 1 N Figure 1: Classical hard-to-learn MDP 6.3 Stochasticity in the system dynamics In this section we consider two opposite classes of problem, namely deterministic MDPs and MDPs that are highly stochastic in that the successor state is sampled from a fixed distribution. These bounds are also a direct consequence of theorem 1 and can alternatively be deduced from corollary 1.2 but are discussed separately due to their importance. Of course, there exists a wide spectrum of MDPs ranging from deterministic to fully stochastic ones; here for simplicity we consider only the limit cases. 1 1 N r = 1 Deterministic domains Many problems of practical interest, for example in robotics, have low stochasticity, andthisimmediatelyyieldslowvalueforq. As a limit case, we consider domains with deterministic rewards and dynamics models. An agent designed to learn deterministic domains only needs to experience every transition once to reconstruct the model, which can take up to OSA episodes with a regret at most OSAH [WV13]. For Euler an improved bound is straightforwardly derived because Q = 0. Therefore if Euler is run on any deterministic MDP then the regret expression exhibits only a logt dependence. This is a substantial improvement since prior RL regret bounds for problem-independent settings all have at least a T dependence. Further, a refined analysis sec G.3 in appendix shows that Euler is close to the lower bound except for a factor in the horizon and logarithmic terms: Proposition 1. If Euler is run on a deterministic MDP then the regret is bounded by ÕSAH2. The result of the above proposition is important because it supports the idea that Euler can inherit nearoptimal performance even in domains very different from the pathologically hard MDP for which it is designed to do well. Highly mixing domains Recently, [ZB18] show that it is possible to design an algorithm that can switch between the MDP and the contextual bandit framework while retaining near-optimal performance in both without being informed of the setting. They consider mapping contextual bandit to an MDP whose transitions to different states or contexts are sampled from a fixed underlying distribution over which the agent has no control. The Bandit-MDP considered in [ZB18] is an environment with high stochasticity the MDP is highly mixing since every state can be reached with some probability in one step. Since the transition function is unaffected by the agent, an easy computation yields rngvt π 1, as replicated in appendix A.2, and a regret guarantee in the leading order term of order Õ SAT for Euler which matches the established lower bound for tabular contextual bandits [BCB12] follows from corollary 1.2. This is useful since in many practical applications it is unclear in advance if the domain is best modeled as a bandit or a sequential RL problem. Notice that our results improves over [ZB18] since Euler has better worst-case guarantees by a factor of H and most importantly can deal with nextstate distributions that have zero or near zero mass over some of the next states: the inverse minimum visitation probability does not show up in our analy-
8 sis. 7 Sketch of the Theoretical Analysis We devote this section to the sketch of the main point of the regret analysis that yields problem dependent bounds. Central to the analysis is the relation between the agent s optimistic MDP and the true MDP. A more detailed overview of the proof is given in section B of the appendix, while the rest of the appendix presents the detailed analysis under a more general framework. Regret Decomposition Denote with E s,a πk the expectation taken along the trajectories identified by the agent s policy. A standard regret decomposition is given below see [DLB17, AOM17]: RegretK E s,a πk r k s,a rs,a Reward Estimation and Optimism k [K] t [H] s,a S A + p k s,a ˆp k s,a Transition Dynamics Optimism +ˆp k s,a p s,a Vt+1 π Transition Dynamics Estimation +ˆp k s,a p s,a Vπ t+1 Lower Order Term 16 Here, the tilde quantities r and p represent the agent s optimistic estimate. Of the terms in equation 16, the Transition Dynamics Estimation and Transition Dynamics Optimism are the leading terms to bound as far as the regret is concerned. The former is expressed through MDP quantities i.e, the true transition dynamics p s,a and the optimal value function Vt+1 π and hence it can be readily bounded using Bernstein Inequality, giving rise to a problem dependent regret contribution. More challenging is to show that a similar simplification can be obtained for the Transition Dynamics Optimism term which relies on the agent s optimistic estimates p k s,a and. Optimism on the System Dynamics Said term p k s,a ˆp k s,a represents the difference between the agent s imagined i.e., optimistic transition p k s,a and the maximum likelihood transition ˆp k s,a weighted by the next-state optimistic value function. By construction, this is the exploration bonus which ignoring log-factors and constants reads: Transition Dynamics = Exploration Optimism Bonus Dominant Term of Exploration Bonus {}}{ Var s ˆpk s,a n k s,a + H n k s,a } {{ } Empirical Bernstein 17 V + V ˆp k s,a + H nk s,a n k s,a Correction Bonus 18 In the above expression the Correction Bonus is needed to ensure optimism because the Empirical Bernstein contribution is evaluated with the agent s estimate as opposed to the real Vt+1. π If we assume that ˆp k s,a shrinks quickly enough, then the Dominant Term in equation 17 is the most slowly decaying term with a rate 1/ n. If that term involved the true transition dynamics p s,a and value function Vt+1 π as opposed to the agent s estimates ˆp k s,a and then problem dependent bounds would follow in the same way as they could be proved for the Transition Dynamics Estimation. Therefore we wish to study the relation between such Dominant Term evaluated with the agent s MDP estimates vs the MDP s true parameters. Dominant Term of the Exploration Bonus The study just discussed gives the upper bound a: Dominant a Var Term of p s,a Vt+1 π Exploration Bonus n k s,a Gives Problem Dependent Bounds + H n k s,a + V ˆp k s,a nk s,a Shrinks Faster 19 In words, we have decomposed the Dominant Term of the Exploration Bonus which is constructed using the agent s available knowledge as a problemdependent contribution that is equivalent to Bernstein Inequality evaluated as if the model was known and a term that accounts for the distance between the two and shrinks faster that the former. This is precisely the Correction Bonus that we use in equation 17 and in the definition of the Algorithm itself. What gives rise to problem dependent bounds? Our analysis highlights that it is as if Euler was using Bernstein Inequality evaluated with precise domain
9 knowledge of p s,a and Vt+1 π to construct problem dependent confidence intervals with a correction term that guarantees optimism and is rapidly decaying. This correction term ˆp k s,a is a function of the inaccuracy of the value function estimate at the next-step states re-weighted by their relative importance as encoded in the experienced transitions ˆp k s,a. Said correction term is of high value only if the successor states do not have an accurate estimate for the value function and they are going to be visited with high probability. A pigeonhole argument guarantees that this situation cannot happen for too longensuring fast decayof ˆp k s,a and therefore of the whole Correction Bonus of eq. 17. Notice that such considerations and results would not be achievable by a naive application of an Hoeffdinglike inequality as the latter would put equal weight on all successor states, but the accuracy in the estimation of Vt+1 π only shrinks in a way that depends on the visitation frequency of said successor states as encoded in ˆp k s,a. The keyto enableproblem dependent bound is, therefore, to re-weight the importance of the uncertainty on the value function of the successor states by the corresponding visitation probability, which Bernstein Inequality implicitly does. 8 Related Literature Our connection with the most closely connected prior work [BT09, FPLO18, MMM14, ZB18] has been described in the main text. Here we focus on the remaining most relevant literature. Bounds that depend on the MDP mixing property: [AO06, TM18] have bounds on the mixing time but hold for MDPs that are unichain or ergodic. Most MDPs do not possess this property. Bounds function of the gap between policies: [EMM06] has bounds dependent on the minimum gap in the optimal state action values between the best and second best action, and Ucrl2 has bounds as function of the gap in the difference in the average reward between the best and second best policies. As these gaps can be arbitrarily small, the bound approaches infinity; further such bounds do not reflect additional problemdependent structure that we and others have considered here, connected to the value function. Regret bounds with value function approximation: [OR14] uses the Eluder dimension as a measure of the dimensionality of the problem and [JKA + 17] proposes the Bellman rank to measure the learning complexity, but such measures aim at capturing a different notion of hardness than ours and do not match the lower bound in the tabular setting. 9 Future Work and Conclusion In this paper we have proposed Euler, an algorithm for episodic finite state and acton MDPs that matches the best known worst-case regret guarantees while provably obtaining much tighter guarantees if the domain has a small variance of the value function over the next-state distribution, previously defined as environmental norm [MMM14]. A key feature is that our Euler does not need to know any bound on the domain environmental norm in advance. In contrast to prior problem-dependent bounds that relied on value function overestimates Φ of the range of the value function rngv π [BT09, FPLO18] across all states, we consider a tighter quantity: the range of the optimal value function restricted to successor states Φ succ in corollary 1.2 as an upper bound to the environmental norm Q. More precisely: Q Φ succ rngv π Φ 20 implying our problem dependent regret bounds based on the environmental norm are tighter than previous proposals based on the range/span of the value function. We show that the environmental norm is low for a number of important subclasses of MDPS, including: MDPs with sparse rewards, near determinisitic MDPs, highly mixing MDPs such as those closer to bandits and some classical empirical benchmarks. We also show how our result helps answer a recent open learning theory question about the necessary dependence of sample complexity and regret results on the planning/ episode horizon. Interesting directions for future work are to see if our results can also be a function of the gaps between the Q values, which is typical of bandit analyses, as well as extending these ideas to the undiscounted infinite horizon reinforcement learning and to the far more challenging function approximation setting for which even problem independent guarantees have proved to be mostly elusive. References [AMK12] Mohammad Gheshlaghi Azar, Remi Munos, and Hilbert J. Kappen. On the sample complexity of reinforcement learning with a generative model. In ICML, 2012.
10 [AO06] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In NIPS, [AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In ICML, [BCB12] Sébastien Bubeck and Nicolò Cesa- Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, [BT09] [DB15] Peter L. Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, Christoph Dann and Emma Brunskill. Sample complexity of episodic fixedhorizon reinforcement learning. In NIPS, [DLB17] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In NIPS, [EMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, [FPLO18] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained explorationexploitation in reinforcement learning [JA18] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages , [JKA + 17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langforda, and Robert E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In ICML, [JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, [KWY18] Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcement learning. Arxiv, [LH14] Tor Lattimore and Marcus Hutter. Nearoptimal pac bounds for discounted mdps. In Theoretical Computer Science, [MMM14] Odalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. how hard is my mdp? the distribution-norm to the rescue. In NIPS, [MP09] [OR14] [OR17] [OV16] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. In COLT, Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In NIPS, Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ICML, Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. In Arxiv, [OVR13] Ian Osband, Benjamin Van Roy, and Daniel Russo. more efficient reinforcement learning via posterior sampling. In NIPS, [SB98] [TM18] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Mohammad Sadegh Talebi and Odalric- Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In ALT, [WOS + 03] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J. Weinberger. Inequalities for the l1 deviation of the empirical distribution. Technical report, Hewlett-Packard Labs, [WV13] [ZB18] Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In NIPS, Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identify bandit structure in mdps. In ICML, 2018.
11 Contents 1 Introduction 1 2 Preliminaries 2 3 Euler 3 4 Main Result Problem Dependent Regret Bound Planning Horizon Dependence in Dominant Term 5 6 Problem dependent bounds Bounds using the range of optimal value function Bounds on the next-state variance of V π and empirical benchmarks Stochasticity in the system dynamics Sketch of the Theoretical Analysis 8 8 Related Literature 9 9 Future Work and Conclusion 9 A Short Proofs Missing from the Main Text 12 A.1 Euler on Chain A.2 Euler on Tabular Contextual Bandits B Appendix Overview and Proof Preview 13 B.1 Algorithm B.2 Optimism B.3 Confidence Interval for the Value Function B.4 Regret Bound C Failure Events and their Probabilities 16 C.1 Empirical Bernstein Inequality for the Rewards C.2 Admissible Confidence Intervals on the Transition Dynamics C.3 Bernstein s Inequality C.4 Other Failure Events and Their Probabilities D Optimism 19 D.1 Rewards D.2 Transition Dynamics
12 D.3 Algorithm is Optimistic E Delta Optimism 22 F The Good Set L tk 24 G Regret Analysis 25 G.1 Main Result G.2 Regret Bounds with Bernstein Inequality G.3 Regret Bound in Deterministic Domain with Bernstein Inequality G.4 Reward Estimation and Optimism G.5 Transition Dynamics Estimation G.6 Transition Dynamics Optimism G.7 Lower Order Term G.8 Auxiliary Lemmas A Short Proofs Missing from the Main Text A.1 Euler on Chain Chain MDPs are commonly given as examples of challenging exploration domains because simple strategies like ǫ-greedy can take an exponential time to learn. We now discuss an intriguing result for the chain shown in figure 1 which is nearly identical to a previously introduced one [OR17]. Like in that domain, each episode is N = H = S timesteps long. The optimal policy is to go right which yields a reward of 1 for the episode. The transition probabilities are assigned in way that scales the optimal value function so that it is of order 1. Since the rewards are deterministic we immediately obtain up to a constant that does not depend on N: Q maxvarqs,a s,a max Var Vt+1s π + s,a 1 s,a,t s,a,t N 1 1 N 1 N since Vt+1s π + is essentially dominated by a Bernoulli random variable with success parameter 1 1/N times an appropriate scaling factor of order one. Therefore Euler s regret 4 is dominated by a term which is Õ Q SAT = Õ 1 T N N 2 T = Õ. Notice that the lower order term should be added to the above expression and this is likely to be significant particularly for small T. However, we remark that the result above follows directly from Theorem 1 whose proof does not attempt to make the lower order term problem-dependent. Our result is substantially smaller than the typically reported bounds for this case, which are dominated by a Õ HSAT term. There are two main factors that lead to the above simplification for this class of MDPs: V π is of order 1 and not N = H like in hard-to-learn MDPs which yields the lower bounds [DB15] and also the variance decreases as we let N increase, each of which removes a factor of N from the known worst-case bound Õ HSAT = Õ N 2 AT after substituting N = S = H. To enable Euler s level of performance on this and other MDPs one needs to carefully control both the confidence intervals of the rewards and of the transition probabilities as Euler does, since Hoeffding-like concentration inequalities for the rewards alone would already induce a Θ SAT = Θ NT contribution to the regret expression. This result is intriguing because it suggests that truly pathological MDP classes which induce Ω HSAT regret are even more uncommon than previously thought. 4 This is valid for any class of MDP which shares these properties, implying that the regret for the MDP in figure 1 can be even smaller 21
13 A.2 Euler on Tabular Contextual Bandits Contextual multi-armed bandits are a generalization of the multiarmed bandit problem, see for example [BCB12] for a survey. In their simplest possible formulation they entail a discrete set of contexts or states {s = 1,2,...} and actions {a = 1,2,...} and the expected reward rs,a depends on both the state and action. After playing an action, the agent transitions to the next states according to some fixed distribution µ R S over which the agent has no control. In principle, such problem can be recast as an MDP in which the next state is independent of the prior state and action. Consider an H-horizon MDP which maps to a tabular contextual bandit problem: the transition probability is identical ps s,a = µs for all states and actions, where µ is a fixed distribution from which the next states are sampled. In such MDP Define the best and worst context at time t, respectively: def s t = argmaxvt def s and s t = argminvt s and recall that the transition dynamics p s,a = µ depends nor on the action a nor on the current state s. We have that: { V t s t = max a rst,a+µ Vt+1 Vt s t = max a rst,a+µ Vt+1 22 which immediately yields a bound on the range of the value function of the successor states: rngv π t+1 = V t s t V t s t = max a rs t,a maxrs a t,a 1 23 where the last inequality follows from the fact that rewards are bounded r, [0,1]. Therefore Φ 1 and corollary 1.2 yields a high probability regret upper bound of order Õ SAT + SSAH 2 S +H 24 for Euler. This means that Euler automatically attains the lower bound in the dominant term for tabular contextual bandits [BCB12] if deployed in such setting. We did not try to improve the lower order terms for this specific setting, which may give an improved bound. B Appendix Overview and Proof Preview We start by giving an overview of the proof that leads to the main result. The setting considered in the appendix is more general than the one presented in the main text. In particular we 1 define a class of concentration inequalities for the transition dynamics 2 show that Euler achieves strong problem dependent regret bounds with any concentration inequality satisfying these assumptions. In particular, the main result presented in the main text follows as a corollary of the potentially more general analysis presented in the appendix. We now give a preview of the proof of the main result, assuming rewards are known. We start be recalling Euler with yet-to-specify confidence intervals on the transition dynamcs. B.1 Algorithm The algorithm is presented in figure 2. B.2 Optimism The goal of this section is to show that Euler guarantees optimism with high probability. As is well known, optimism is the driver of exploration as it allows to overestimate the regret by a concentration term with high probability. Let s consider the planning process at the beginning of episode k. This is detailed in lines 4 to 16 of algorithm 1. In order to guarantee finding a pointwise optimistic value function tk V t π a bonus is added to account for bad luck in the system dynamics experienced up to episode k. If the agent knew the value of the confidence intervalfor the system dynamicsφp s,a,vt+1 π then optimism could be inductively guaranteedi.e., assuming that, by induction, V t+1 π holds pointwise if said confidence interval holds: tk = rs,a+ ˆp k s,a π +φp s,a,vt+1 25
14 Algorithm 2 Euler for Stationary Episodic MDPs 1: Input: confidence interval b r k, φ, with failure probability δ, constants B p,b v. 2: ns,a = r sum s,a = p sum s,s,a = 0, s,a S A; V H+1 s = 0, s S 3: for k = 1,2,... do 4: for t = H,H 1,...,1 do 5: for s S do 6: for a A do 7: ˆp = psum,s,a ns,a 8: b pv k = φˆps,a,v 1 4J+B t+1+ p +B ns,a nk v V t+1 V s,a t+1 2,ˆp 9: Qa = min{h t,ˆr k s, s,t+b r k s,a+ ˆp V t+1 +b pv k } 10: end for 11: s,t = argmax a Qa 12: V t s = Q s,t 13: b pv k = φˆps, π 1 4J+B ks,t,v t+1 p ns, πk s,t nk v V t+1 V s, s,t t+1 2,ˆp 14: V t s = max{0,ˆr k s, s,t b r k s,s,t+ ˆp V t+1 b pv k } 15: end for 16: end for 17: s 1 p 0 18: for t = 1,...H do 19: a t = s t,t; r t p R s t,a t ; s t+1 p P s t,a t 20: ns t,a t ++; p sum s t+1,s t,a t ++ 21: end for 22: end for rs,a+ ˆp k s,a Vt+1 π π +φp s,a,vt+1 rs,a+p s,a Vt+1 π V t+1 π 26 If the above conclusion is true for every action then it is true for the maximizer as well. Unfortunately the agent knows nor the real transition dynamics nor the optimal value function to evaluate φ. Instead, it only has access to the estimated transition dynamics ˆp k s,a and to an overestimate of the value function. Unfortunately the confidence interval φ evaluated with such quantities φˆp k s,a, is not guaranteed to overestimate φp s,a,vt+1 π and optimism may not be guaranteed. To remedy this the agent can try to estimate the difference φˆp k s,a, π φp s,a,vt+1 27 and add a correction term to account for that difference. A similar problem is faced in [AOM17] where the authors propose an optimistic bonus which guarantees optimism when using the empirical Bernstein Inequality. By distinction, our way of constructing the bonus works with any concentration inequality satisfying assumption 1 and 2. Precisely, optimism is dealt with in section D in the appendix; in lemma 4 we show how to bound equation 27 obtaining the result below: φˆp k s,a,v φp s,a,vt+1 π B v V V π nk s,a t+1 2,ˆp + B p +4J n k s,a. 28 This is essentially a consequence of the definition of admissible bonus, i.e., def 3. Unfortunately the upper bound in equation 28 depends on Vt+1 π which is not known, so the problem is still unsolved. However, as we show in lemma 5 in the appendix it suffices to pointwise overestimate Vt+1. π To this aim, the algorithm maintains an underestimate of Vt+1 π which we call. Equipped with this underestimate, we define the Exploration Bonus in definition 5 in the appendix, which we report below: b pv k ˆp k s,a,, def = φˆp k s,a, + 4J +B p n k s,a + B v V 2,ˆp nk s,a. 29
Reinforcement Learning
Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationImproving PAC Exploration Using the Median Of Means
Improving PAC Exploration Using the Median Of Means Jason Pazis Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 039, USA jpazis@mit.edu Ronald Parr Department
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationAn Analysis of Model-Based Interval Estimation for Markov Decision Processes
An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University
More informationLogarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract
More informationSelecting the State-Representation in Reinforcement Learning
Selecting the State-Representation in Reinforcement Learning Odalric-Ambrym Maillard INRIA Lille - Nord Europe odalricambrym.maillard@gmail.com Rémi Munos INRIA Lille - Nord Europe remi.munos@inria.fr
More informationA Theoretical Analysis of Model-Based Interval Estimation
Alexander L. Strehl Michael L. Littman Department of Computer Science, Rutgers University, Piscataway, NJ USA strehl@cs.rutgers.edu mlittman@cs.rutgers.edu Abstract Several algorithms for learning near-optimal
More informationJournal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes
Journal of Computer and System Sciences 74 (2008) 1309 1331 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.elsevier.com/locate/jcss An analysis of model-based Interval
More informationExperts in a Markov Decision Process
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationPAC Model-Free Reinforcement Learning
Alexander L. Strehl strehl@cs.rutgers.edu Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora ewiewior@cs.ucsd.edu Computer Science
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationPracticable Robust Markov Decision Processes
Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)
More informationLearning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationCsaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008
LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,
More informationOnline regret in reinforcement learning
University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T
More informationBias-Variance Error Bounds for Temporal Difference Updates
Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds
More informationEfficient Exploration and Value Function Generalization in Deterministic Systems
Efficient Exploration and Value Function Generalization in Deterministic Systems Zheng Wen Stanford University zhengwen@stanford.edu Benjamin Van Roy Stanford University bvr@stanford.edu Abstract We consider
More informationOnline Regret Bounds for Markov Decision Processes with Deterministic Transitions
Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Ronald Ortner Department Mathematik und Informationstechnologie, Montanuniversität Leoben, A-8700 Leoben, Austria Abstract
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More information(More) Efficient Reinforcement Learning via Posterior Sampling
(More) Efficient Reinforcement Learning via Posterior Sampling Osband, Ian Stanford University Stanford, CA 94305 iosband@stanford.edu Van Roy, Benjamin Stanford University Stanford, CA 94305 bvr@stanford.edu
More informationMultiple Identifications in Multi-Armed Bandits
Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationNotes on Tabular Methods
Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from
More informationarxiv: v1 [cs.lg] 12 Feb 2018
in Reinforcement Learning Ronan Fruit * 1 Matteo Pirotta * 1 Alessandro Lazaric 2 Ronald Ortner 3 arxiv:1802.04020v1 [cs.lg] 12 Feb 2018 Abstract We introduce SCAL, an algorithm designed to perform efficient
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationFood delivered. Food obtained S 3
Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationCS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability
CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)
More informationLower PAC bound on Upper Confidence Bound-based Q-learning with examples
Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationarxiv: v2 [cs.lg] 1 Dec 2016
Contextual Decision Processes with Low Bellman Rank are PAC-Learnable Nan Jiang Akshay Krishnamurthy Alekh Agarwal nanjiang@umich.edu akshay@cs.umass.edu alekha@microsoft.com John Langford Robert E. Schapire
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationarxiv: v2 [stat.ml] 17 Jul 2013
Regret Bounds for Reinforcement Learning with Policy Advice Mohammad Gheshlaghi Azar 1 and Alessandro Lazaric 2 and Emma Brunskill 1 1 Carnegie Mellon University, Pittsburgh, PA, USA {ebrun,mazar}@cs.cmu.edu
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationAction Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
Journal of Machine Learning Research (2006 Submitted 2/05; Published 1 Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Eyal Even-Dar evendar@seas.upenn.edu
More informationValue Directed Exploration in Multi-Armed Bandits with Structured Priors
Value Directed Exploration in Multi-Armed Bandits with Structured Priors Bence Cserna Marek Petrik Reazul Hasan Russel Wheeler Ruml University of New Hampshire Durham, NH 03824 USA bence, mpetrik, rrussel,
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationLecture 5: Regret Bounds for Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined
More informationCS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability
CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationBayesian Reinforcement Learning with Exploration
Bayesian Reinforcement Learning with Exploration Tor Lattimore 1 and Marcus Hutter 1 University of Alberta tor.lattimore@gmail.com Australian National University marcus.hutter@anu.edu.au Abstract. We consider
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationSelecting Near-Optimal Approximate State Representations in Reinforcement Learning
Selecting Near-Optimal Approximate State Representations in Reinforcement Learning Ronald Ortner 1, Odalric-Ambrym Maillard 2, and Daniil Ryabko 3 1 Montanuniversitaet Leoben, Austria 2 The Technion, Israel
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationA General Overview of Parametric Estimation and Inference Techniques.
A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying
More informationPAC Subset Selection in Stochastic Multi-armed Bandits
In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram
More informationAnytime optimal algorithms in stochastic multi-armed bandits
Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationSample Complexity of Multi-task Reinforcement Learning
Sample Complexity of Multi-task Reinforcement Learning Emma Brunskill Computer Science Department Carnegie Mellon University Pittsburgh, PA 1513 Lihong Li Microsoft Research One Microsoft Way Redmond,
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationSparse Linear Contextual Bandits via Relevance Vector Machines
Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,
More informationNear-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model
Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model Aaron Sidford Stanford University sidford@stanford.edu Mengdi Wang Princeton University mengdiw@princeton.edu
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationPolicy Gradient Reinforcement Learning Without Regret
Policy Gradient Reinforcement Learning Without Regret by Travis Dick A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Computing Science University
More informationSequential Decision Problems
Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationBiasing Approximate Dynamic Programming with a Lower Discount Factor
Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationSpeedy Q-Learning. Abstract
Speedy Q-Learning Mohammad Gheshlaghi Azar Radboud University Nijmegen Geert Grooteplein N, 6 EZ Nijmegen, Netherlands m.azar@science.ru.nl Mohammad Ghavamzadeh INRIA Lille, SequeL Project 40 avenue Halley
More informationBellmanian Bandit Network
Bellmanian Bandit Network Antoine Bureau TAO, LRI - INRIA Univ. Paris-Sud bldg 50, Rue Noetzlin, 91190 Gif-sur-Yvette, France antoine.bureau@lri.fr Michèle Sebag TAO, LRI - CNRS Univ. Paris-Sud bldg 50,
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationTesting Problems with Sub-Learning Sample Complexity
Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationWhy is Posterior Sampling Better than Optimism for Reinforcement Learning?
Ian Osband 12 Benjamin Van Roy 1 Abstract Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms existing algorithms driven by optimism, such
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationOnline Learning Schemes for Power Allocation in Energy Harvesting Communications
Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationPure Exploration Stochastic Multi-armed Bandits
C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationWhy is Posterior Sampling Better than Optimism for Reinforcement Learning?
Journal of Machine Learning Research () Submitted ; Published Why is Posterior Sampling Better than Optimism for Reinforcement Learning? Ian Osband Stanford University, Google DeepMind Benjamin Van Roy
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More information