arxiv: v1 [cs.lg] 1 Jan 2019

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 1 Jan 2019"

Regina Gardner
5 years ago
Views:

1 Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds arxiv: v1 [cs.lg] 1 Jan 2019 Andrea Zanette Institute for Computational and Mathematical Engineering, Stanford University Abstract Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problem-dependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the nextstate value functions. An important benefit of our algorithmic is that it does not require apriori knowledge of a bound on the environmental norm. As a result of our analysis, we also help address an open learning theory question [JA18] about episodic MDPs with a constant upper-bound on the sum of rewards, providing a regret bound with no H-dependence in the leading term that scales a polynomial function of the number of episodes. 1 Introduction Emma Brunskill Department of Computer Science, Stanford University In reinforcement learning RL an agent must learn how to make good decision without having access to an exact model of the world. Most of the literature for provably efficient exploration in Markov decision processes MDPs [JOA10, OVR13, LH14, DB15, DLB17, OR17, AOM17, KWY18] has focused on providing near-optimal worst-case performance bounds: such bounds are highly desirable as they do not depend on the structure of the particular environment considered and therefore hold for even extremely hardto-learn MDPs. Fortunately in practice reinforcement learning algorithms often perform far better than what these problem-independent bounds would suggest. While we may observe better or worse performance empirically on different MDPs, we would like to derive a more systematic understanding of what types of decision processes are inherently easier or more challenging for RL. This motivates our interest in deriving algorithms and theoretical analyses that provide problemdependent bounds. Ideally, such algorithms will do as well as RL solutions designed for the worst case if the problem is pathologically difficult and otherwise match the performance bounds of algorithms specifically designed for a particular problem subclass. This exciting scenario might bring considerable saving in the time spent designing domain-specific RL solutions and in training a human expert to judge and recognize the complexity of different problems. An added benefit would include the robustness of the RL solution in case the actual model does not belong to the identified subclass, yielding increased confidence to deploying RL to high-stakes applications. Towards this goal, in this paper we contribute a new algorithm for episodic tabular reinforcement learning which automatically provides provably stronger regret bounds in many domains which have a smaller environmental norm [MMM14], which is a characterization of the variance of the optimal value function. Indeed, there is good reason to believe that some features of the range or variability of the optimal value function should be a critical aspect of the hardness of reinforcement learning in a MDP. Many worst-case bounds for

2 finite-state MDPs scale with a worst case bound on the range / magnitude of the value function. For example, Ucrl2 from [JOA10] and a recently proposed algorithm from [AOM17] enjoy a regret bound 1 of Õ DS AT Undiscounted [JOA10] Õ HSAT Fixed Horizon [AOM17] 1 where D is the diameter for an infinite-horizon setting and H, the horizon in an episodic problem, respecctively. Note that here both D and H arise in the analyses as upper bounds on the range of the optimistic value function across the entire MDP. In fact, many RL algorithms with strong performance bounds rely on the principle of optimism under uncertainty and compute an optimistic value function. As more samples are collected, one would hope that the agent s optimistic value function converges to the true optimal value function. Unfortunately this is not the case, see for example [JOA10, BT09, ZB18] for a discussion of this. As a result, most prior analyses bounded the optimistic value function by generic quantities like D or H regardless of the actual behaviour of the optimal value function. While the majority of formal performance guarantees has focused on bounds for the worst case, there have been several contributions of algorithms and/or theoretical analysis focused on MDPs with particular structure. One quantity of interest is the range of the optimal value function, and prior work [BT09] proposed an algorithm that can replace the diameter D with an overestimate of range of the optimal value function across all states. A tractable version this algorithm was subsequently derived [FPLO18]. However, while these approaches provide strong regret bounds that highlight the better performance that can be obtained in domains with a small range in the optimal value function, these approaches assume apriori domain knowledge of the range of the optimal value function to attain improved performance. In contrast, [MMM14] introduced a slightly different quantity, the environmental norm, that depends on the variance of only the optimal value function across successor states. The authors demonstrate this measure of MDP complexity empirically explains the complexity of doing RL in a number of common simulation benchmarks. However, the algorithm and analysis they derive do not enjoy an improved regret bound compared to the available worst-case analyses, nor do they provide performance bounds for specific domain subclasses, so their primary contribution is a good complexity measure that can help empirically explain the different hardness of various RL problems. 1 leading order term only In our paper we build on this promising definition of domain complexity, the environmental norm, but go significantly further to show that we can derive an algorithm for finite horizon discrete MDPs and associated analysis that yields state-of-the art worst-case regret bounds of order Õ HSAT in the leading term all the while improving if the environment has small environmental norm. Compared to the existing literature, our work Maintains state of the art worst-case guarantees [AOM17, KWY18], Improves on the regret bounds of[bt09, FPLO18] when restricted to the episodic setting and without the need of domain knowledge of a bound on the range of the value function, Improves the regret bounds of [ZB18] when deployed in the same settings, Provides the first demonstration that characterizing problems using environmental norm[mmm14] can yield substantially tighter theoretical guarantees, and Identifies significant problem classes with low environmental norm which are of significant interest, including deterministic domains, single-goal MDPs, and high stochasticity domains. Helps address an open learning theory problem [JA18], showing that for their setting, we obtain a regret bound that scales with no dependence on the planning episode horizon in the leading term that is polynomial function of the number of episodes. The paper is organized as follows: we recall some basic definitions in Section 2 and describe the algorithm in Section 3. We state and comment the main result in Section 4, discuss how this helps address an open learning theory problem in Section 5 and then describe selected problem-dependent bounds in Section 6. The analysis is sketched in Section 7. Due to space constraints, most proofs are deferred to the appendix. 2 Preliminaries In this section we introduce some notation and definitions commonly adopted in RL. We consider undiscounted finite horizon MDPs [SB98], which are defined by a tuple M = S,A,p,r,H, where S and A are the state and action spaces with cardinality S and A, respectively. We denote by ps s,a the probability of transitioning to state s after taking action a

3 in state s while rs,a [0,1] is the average instantaneous reward collected. We label with n k s,a the visits to the s,a pair at the beginning of the k-th episode. The agent interacts with the MDP starting from arbitrary initial states in a sequence of episodes k [K] where [K] = {j N : 1 j K} of fixed length H by selecting a policy which maps states s and timesteps t to actions. Each policy identifies a value function for every state s and timestep t [H] defined as V π k H t s t = E s,a πk i=trs,a which is the expected return until the end of the episode the conditional expectation is over the pairs s, a encountered in the MDP upon starting from s t. The optimal policy is indicated with π and its value function as Vt π. We indicate with and, respectively, a pointwise underestimate, respectively, overestimate, of the optimal value function and with ˆp k s,a and ˆr k s,a the MLE estimates of p s,a and rs,a. We focus on deriving a high probability upper bound on the RegretK def = k [K] V1 π s k 1 s k to measure the agent s learning performance. We use the Õ notation to indicate a quantity that depends on up to a polylog expression of a quantity at most polynomial in S,A,T,K,H, 1 δ. We also use the,, notation to mean,,=, respectively, up to a numerical constant and indicate with X 2,p the 2-norm of a random variable 2 under p, i.e., def X 2,p = E p X 2 def = s ps X 2 s if p is its probability mass function. 3 Euler We introduce the algorithm Episodic Upper Lower Exploration in Reinforcement learning Euler which adopts the paradigm of optimism under uncertainty to conduct exploration. Recent work [DB15, DLB17, AOM17, KWY18] has demonstrated how the choice of the exploration bonus is critical to enabling tighter problem-independent performance bounds. In particular, tighter confidence bounds on the expected future rewards, ˆp k s,a p s,a Vt+1, π obtained by leveraging Bernstein s inequality, has lead to significant improvements. However, as the true transition model p s,a and next state values Vt+1 π are unknown, existing algorithms must rely on the empirical estimates of these quantities to construct tight confidence intervals but must do so in a way that preserves optimism which is a critical property of most theoretical analyses. This has given rise to various methods that introduce additional correction terms to ensure optimism. 2 To be precise, this is a norm between classes of random variables that are almost surely the same In our algorithm we use the empirical Bernstein Inequality [MP09] for estimating both the rewards and transition dynamic deviation terms b r k s,a and φp,v in the algorithm, respectively with a different correction term. This allows Euler to preserve in the dominant term the best problem-independent bound Õ HSAT for this setting while simultaneously attaining much stronger problem-dependent bounds than have been previously obtained. The key insight is that by maintaining point-wise consistent upper and lower bounds directly on V π, we can define a bonus term that both guarantees optimism, decays quickly and is a function of the underlying problem. We provide pseudocode for Euler which details the main procedure in figure 1. Notice that Euler has the same computational complexity as value iteration. 4 Main Result Before stating the result, we recall the definition of environmental norm which first appeared in [MMM14]. The definition advanced in [MMM14] is for infinite horizon RL but can very easily be adapted to fixed horizon RL while including the reward variances: Definition 1 Environmental Norm. The environmental norm Q of an MDP is the maximum conditional variance conditioned on the s, a pair of the immediate reward R, random variable and nextstate optimal value function: Q def = max s,a VarRs,a+ Var V π s + ps,a t+1s +. 2 Intuitively, if Q is small then it is not difficult to estimatetheq t valuesfromtheavailablesamplesprovided that Vt+1 π is known. 4.1 Problem Dependent Regret Bound Now we present our main result, which is a problemdependent high-probability regret upper bound for Euler and a worst-case guarantees that matches the established [OV16, JOA10] lower bound of Ω HSAT in the dominant term. Before this, we define the maximum return under any policy and any realization of the rewards and transitions within an episode as G. In other words, G is a deterministic upper bound on the random total reward collected within an episode. Theorem 1 Problem Dependent High Probability Regret Upper Bound for Euler. With probability at least 1 δ the regret of Euler is bounded for any time T KH by the minimum between O polylogs,a,t,1/δ Q SAT + SSAH 2 S +H 3

4 Algorithm 1 Euler for Stationary Episodic MDPs = 1 6 1: Input: failure probability δ, δ def 2: φp,v def = 2 Var pv ln 4SAT δ n k s,a 3: for k = 1,2,... do 4: for t = H,H 1,...,1 do 5: for s S do 6: for a A do 7: b pv k = φˆps,a,v t δ, L def = 2ln12SAT/δ, V H+1 = V H+1 = 0 7H ln 4SAT δ 3n k s,a, b r k s,a = ns,a 2 VarRs,aln 4SAT δ n k s,a 28HL +L V t+1 V 3 n k s,a t+1 ˆps,a + 7ln 4SAT δ 3n k s,a 8: Qa = min{h t,ˆr k s,t+b r k s,a+ ˆp V t+1 +b pv k } 9: end for 10: s,t = argmax a Qa, V t s = Qs, π k s,t 11: b pv k = φˆps, πs,t,v 1 t+1 28HL +L V t+1 V ns,a 3 n k s, s,t t+1 ˆps, πs,t 12: V t s = max{0,ˆr k s, s,t b r k s,s,t+ ˆp V t+1 b pv k } 13: end for 14: end for 15: Evaluate policy and update MLE estimates ˆp, and ˆr, 16: end for and directly implies that Euler can leverage the structure G 2 O polylogs, A, T, 1/δ H SAT + SSAH 2 of a number of commonly encountered MDPs. S +H, Maillard et al. [MMM14] also propose an algorithm 4 that depends on this useful metric of hardness and jointly for all episodes k [K]. show a regret bound whose leading order term is: Since the rewards are in [0,1], we immediately have that G 2 H 2, and obtain the following corollary for the worst-case regret bound: Corollary 1.1. With probability at least 1 δ the regret of Euler is bounded for any time T KH by Õ HSAT + SSAH 2 S +H. 5 Note that this matches in the dominant term the minimax regret problem independent bounds for tabular episodic RL settings [AOM17]. Therefore, the importance of our theorem 1 lies in providing problem dependent bounds equation 3,4 while simultaneously matching the existing best worst case guarantees equation 5. In addition, we shall shortly see that our results also provide tighter results than prior work for a general alternate definition of episodic MDPs that was recently introduced [JA18]. The key quantity that encodes the MDP complexity is Q, the environmental norm of [MMM14]. Notice that the algorithm does not need to be informed of the value of Q. Prior empirical evaluation of Q in [MMM14] has shown in a number of common empirical benchmarks Q has small value: see [MMM14] for a description and references. This empirical evaluation of the environmental norm is encouraging, because it 1 Õ DS Q p AT. 6 0 Unfortunately to our understanding their submission does not improve over worst-case analyses for that setting since the worst-case value function upper bound which is the diameter D in their setting and H in ours is still a factor that shows up in the leading order regret contribution. Said factor is instead replaced altogether by the environmental norm in our analysis. In addition, the inverse of the minimum non-zero transition probability p 0 = min s,s,aps s,a appears in the regret bound. As this can be arbitrarily small, the bound becomes vacuous. Further, their algorithmmustknownp 0. Forthesereasons,theauthors of [MMM14] recognize that their main contribution is mostly about identifying an useful norm and its dual that characterizes the hardness of a number of empirical problems, but the question of whether the environmental norm is a useful quantity from a theoretical point of view was essentially left as open problem. Indeed our contribution can be viewed as a realization of that vision, i.e., proposing an algorithm and analysis which order-wise scale with the environmental norm.

5 5 Planning Horizon Dependence in Dominant Term Before discussing the implications of our above result in several common classes of Markov decision processes with particular different forms of structure, we first show that these new results can help address a recent posed open question in the learning theory community [JA18]. The question posed centers on the whether there should exist a necessary dependence of sample complexity and regret lower bounds on the planning horizon H for episodic tabular MDP reinforcement learning tasks. Existing lower bound results for sample complexity [DB15] do not have a dependence on the horizon, and the best existing minimax regret bounds do have a dependence on the horizon under asymptotic assumptions[aom17]; however, such results have been derived under the common assumption of reward uniformity, that per-time-step rewards are bounded between 0 and 1. Jiang and Agarwal [JA18] instead pose a more general setting, in which they assume that per time step rewards r h 0 and H h=1 r h [0,1] holds almost surely: note the standard setting of reward uniformity can be expressed in this setting by first normalizing all rewards by dividing by H. The authors then ask that if in this new, more general setting of tabular episode RL, if there is necessarily a dependence on the planning horizon in the lower bounds. 9 Note that the planning/episodic horizon H does not appear in the dominant regret term which scales polynomially with the number of episodes K, and only appears in terms that scale logarthmically with the number of episodes K, or as part of transient lower order terms that are independent of K. In other words, up to logarithmic dependency and transient terms, we have an upper bound on episodic regret that is independent of the horizon H. This result then answers part of Jiang and Agarwal s open question: for their setting, we have shown regret scales independently of the horizon H in the dominant term. Notice that our algorithm has no dependence on H that scales as polynomial function of K, and our algorithm does not need to be provided with information about the maximum possible value function. Jiang and Agarwal [JA18] also ask about the dependence of a lower bound on the sample complexity on the planning horizon. While our work focuses on a regret analysis, and does not provide PAC sample complexity results, we can use our regret results to bound with high probability the number of episodes needed to ensure that the average per-episode regret is less than ǫ To do so, we obtain the average per-episode loss of Euler by dividing by the number of episodes K: O polylogs,a,k,h,1/δ SA K + SSAH 2 S +H From here we can seek for the smallest K such that the average error is smaller than ǫ, obtaining: SA SSAH 2 O polylogs,a,k,h,1/δ ǫ + S +H 2 ǫ 11 episodes before the average per-episode error is smaller than ǫ. For small ǫ << SA, the first term dominates, which is again independent of a polynomial dependence on H. Of course, in order to formally obtain a PAC result a worst case upper bound on the number Their assumptions immediately imply that of ǫ-suboptimal episodes the algorithm would need to be modified to be PAC. In particular, the exploration 0 Vt π s G 1, s,t S [H]. 7 bonus for a given s,a pair should be designed so it does not increasewith time T if s,a is not visited. In Further, since Vt π s 1 and rs,a 0 we must practice this means replacing the logt factor of the have rs,a [0,1], which is the assumption of this exploration bonuses with something like logn where work. Therefore our main result theorem 1 applies n is the visit count to a specific state-action pair and here. Recalling that T = KH, we obtain an upper bound on regret as adjusting the numerical constant to make sure the exploration bonuses / confidence intervals are still valid 1 O polylogs,a,k,h,1/δ H SAKH + SSAH 2 with high probability. Please see [DLB17] for a detailed explanation of how to proceed with the algo- S +H rithm design and analysis is this case 8 3. = O polylogs,a,k,h,1/δ SAK + SSAH 2 S +H. It remains an open question whether we could further K 10 avoid either a dependence on in our regret bounds on the planning horizon either in the transient terms or in the log dependencies. However, these results are promising, since they suggest that the hardness of learning in sparse reward, and long horizon episodic MDP environments, may not be fundamentally much harder than shorter horizon domains, as long as the total reward is bounded. We next demonstrate that several classes of episodic MDPs have additional structure that also guarantee 3 Indeed, the algorithm described in [DLB17] is structurally similar to ours.

6 better performancelower regret bounds, even though our algorithm need not be aware of such structure. 6 Problem dependent bounds We now focus on deriving regret bounds for selected MDP classes that are very common in RL. We emphasize that such setting-dependent guarantees are obtained with the same algorithm which is agnostic to the particular MDP at hand and its Q value. Although the described settings share common features and are sometimes subclasses of one another, they are emphasized in separate subsections due to their important relation to the past literature as well as their practical relevance. Importantly, they are all characterized by low environmental norm. 6.1 Bounds using the range of optimal value function To improve over the worst case bound in infinite horizonrltherehavebeenapproachesthataimatobtaining stronger problem dependent bounds if the value function does not vary much across different states of the MDP. If rngv π is smaller than the worst-case either H or D for the fixed horizon vs recurrent RL, the reduced variability in the expected return suggests that we can lower the exploration effort by constructing tighter confidence intervals. This is achieved in [BT09] by providing additional information to their algorithm Regal, achieving a regret bound of order: ÕΦS AT 12 where Φ rngv π is an overestimate of the optimal value function range and is an input to the algorithm described in that paper. This means that if domain knowledge is available and is supplied to the algorithm the regret can be substantially reduced. This line of research was followed in [FPLO18] which derived a computationally-tractable variant of Regal. However, they still require knowledge of a value function rangeupper bound Φ rngv π. Specifying a too high value for Φ would increase the regret and a too low value would cause the algorithm to fail. Although restricted to the episodic setting, our analysis implies that it is possible to achieve at least the same but potentially much better level of performance without knowing the optimal value function range to construct the exploration bonus. This follows as an easy corollary of our main regret upper bound theorem 1 after bounding the environmental norm, as we discuss below. Let S s,a be the set of immediate successor states after one transition from state s upon taking action a there, thatis, thestatesinthesupportofp s,aanddefine Φ succ def = max rng Vt+1s π + 13 s,a s + S s,a as the maximum value function range when restricted to the immediate successor states. Since the variance is upper bounded by one fourth of the square range of a random variable we have that: Q def = max max s,a s,a VarRs,a s,a+ Var [rngrs,a] 2 +[ rng V π s + S s,a V π s + ps,a t+1 s+ t+1 s+ ] 2 1+Φ 2 succ. This immediately yields a result comparable to that of [BT09, FPLO18] when restricted to the fixed horizon framework: Corollary 1.2 Bounded Range of V π Among Successor States. With probability at least 1 δ, the regret of Euler is bounded by: ÕΦ succ SAT + SSAH 2 S +H. 14 Few remarks are in order: the problem independent bound D or H is replaced by a problem dependent quantity as in [BT09, FPLO18] Euler does not need to know the value of Φ succ or of the environmental norm or of the value function range to attain the improved bound Φ succ can be much smaller than rngv π because it is the range of V π restricted to few successor states as opposed to across the whole domain, and therefore it is always smaller than the overestimateφusedin[bt09,fplo18], inotherwords: Φ rngv π Φ succ. 15 [BT09, FPLO18] consider the more challenging infinite horizon setting. Our results holds for fixed horizon RL. It s an interesting research question whether our result can be extended to the recurrent RL setting considered by [BT09, FPLO18]. 6.2 Bounds on the next-state variance of V π and empirical benchmarks As we discuss throughout this section, the environmental norm together with our new result in Theorem 1 can help explain the hardness of subclasses of MDP domains. Before considering another important

7 subclass in the next subsection, we first wish to highlight that the environmental norm also can empirically characterize the hardness of RL in single problem instances. This was one of the nice key contributions of the work that introduced the envirornmental norm [MMM14], which evaluated the environmental norm for a number of common RL benchmark simulation tasks including mountain car, pinball, taxi, bottleneck, inventory and red herring see the table in paragraph 3.2 in [MMM14]. In these domains the environmental norm is nicely correlated with the complexity of reinforcement learning in these environments, as evaluated empirically. Indeed, in these settings, the environmental norm is often much smaller then the maximum value function range, which can itself be much smaller than the worst-case bound D or H. Our new results provide solid theoretical justification for the observed empirical savings. This measure of MDP complexity also intriguingly allows us to gain more insight on another important simulation domain, chain MDPs like that in Figure 1. Chain MDPS have been considered a canonical example of challenging hard-to-learn RL domains, since naive strategies like ǫ greedy can take exponential time to attain satisfactory performance. By setting for simplicity N def = S = H efficient exploration strategies for example [AOM17, KWY18] should incur Õ HSAT = Õ N 2 T regret in the leading order term. In contrast our Euler only incurs leading order term only Õ T leading orderregret. This is intriguing because it reinforces the idea that truly pathological MDPs may be even less common than expected. More details about this example are given in appendix A.1 r = 1 4N 1 N 1 1 N s1 s2 s3 sn 2 sn 1 sn 1 1 N 1 1 N 1 1 N 1 1 N N 1 1 N 1 1 N 1 1 N 1 1 N Figure 1: Classical hard-to-learn MDP 6.3 Stochasticity in the system dynamics In this section we consider two opposite classes of problem, namely deterministic MDPs and MDPs that are highly stochastic in that the successor state is sampled from a fixed distribution. These bounds are also a direct consequence of theorem 1 and can alternatively be deduced from corollary 1.2 but are discussed separately due to their importance. Of course, there exists a wide spectrum of MDPs ranging from deterministic to fully stochastic ones; here for simplicity we consider only the limit cases. 1 1 N r = 1 Deterministic domains Many problems of practical interest, for example in robotics, have low stochasticity, andthisimmediatelyyieldslowvalueforq. As a limit case, we consider domains with deterministic rewards and dynamics models. An agent designed to learn deterministic domains only needs to experience every transition once to reconstruct the model, which can take up to OSA episodes with a regret at most OSAH [WV13]. For Euler an improved bound is straightforwardly derived because Q = 0. Therefore if Euler is run on any deterministic MDP then the regret expression exhibits only a logt dependence. This is a substantial improvement since prior RL regret bounds for problem-independent settings all have at least a T dependence. Further, a refined analysis sec G.3 in appendix shows that Euler is close to the lower bound except for a factor in the horizon and logarithmic terms: Proposition 1. If Euler is run on a deterministic MDP then the regret is bounded by ÕSAH2. The result of the above proposition is important because it supports the idea that Euler can inherit nearoptimal performance even in domains very different from the pathologically hard MDP for which it is designed to do well. Highly mixing domains Recently, [ZB18] show that it is possible to design an algorithm that can switch between the MDP and the contextual bandit framework while retaining near-optimal performance in both without being informed of the setting. They consider mapping contextual bandit to an MDP whose transitions to different states or contexts are sampled from a fixed underlying distribution over which the agent has no control. The Bandit-MDP considered in [ZB18] is an environment with high stochasticity the MDP is highly mixing since every state can be reached with some probability in one step. Since the transition function is unaffected by the agent, an easy computation yields rngvt π 1, as replicated in appendix A.2, and a regret guarantee in the leading order term of order Õ SAT for Euler which matches the established lower bound for tabular contextual bandits [BCB12] follows from corollary 1.2. This is useful since in many practical applications it is unclear in advance if the domain is best modeled as a bandit or a sequential RL problem. Notice that our results improves over [ZB18] since Euler has better worst-case guarantees by a factor of H and most importantly can deal with nextstate distributions that have zero or near zero mass over some of the next states: the inverse minimum visitation probability does not show up in our analy-

8 sis. 7 Sketch of the Theoretical Analysis We devote this section to the sketch of the main point of the regret analysis that yields problem dependent bounds. Central to the analysis is the relation between the agent s optimistic MDP and the true MDP. A more detailed overview of the proof is given in section B of the appendix, while the rest of the appendix presents the detailed analysis under a more general framework. Regret Decomposition Denote with E s,a πk the expectation taken along the trajectories identified by the agent s policy. A standard regret decomposition is given below see [DLB17, AOM17]: RegretK E s,a πk r k s,a rs,a Reward Estimation and Optimism k [K] t [H] s,a S A + p k s,a ˆp k s,a Transition Dynamics Optimism +ˆp k s,a p s,a Vt+1 π Transition Dynamics Estimation +ˆp k s,a p s,a Vπ t+1 Lower Order Term 16 Here, the tilde quantities r and p represent the agent s optimistic estimate. Of the terms in equation 16, the Transition Dynamics Estimation and Transition Dynamics Optimism are the leading terms to bound as far as the regret is concerned. The former is expressed through MDP quantities i.e, the true transition dynamics p s,a and the optimal value function Vt+1 π and hence it can be readily bounded using Bernstein Inequality, giving rise to a problem dependent regret contribution. More challenging is to show that a similar simplification can be obtained for the Transition Dynamics Optimism term which relies on the agent s optimistic estimates p k s,a and. Optimism on the System Dynamics Said term p k s,a ˆp k s,a represents the difference between the agent s imagined i.e., optimistic transition p k s,a and the maximum likelihood transition ˆp k s,a weighted by the next-state optimistic value function. By construction, this is the exploration bonus which ignoring log-factors and constants reads: Transition Dynamics = Exploration Optimism Bonus Dominant Term of Exploration Bonus {}}{ Var s ˆpk s,a n k s,a + H n k s,a } {{ } Empirical Bernstein 17 V + V ˆp k s,a + H nk s,a n k s,a Correction Bonus 18 In the above expression the Correction Bonus is needed to ensure optimism because the Empirical Bernstein contribution is evaluated with the agent s estimate as opposed to the real Vt+1. π If we assume that ˆp k s,a shrinks quickly enough, then the Dominant Term in equation 17 is the most slowly decaying term with a rate 1/ n. If that term involved the true transition dynamics p s,a and value function Vt+1 π as opposed to the agent s estimates ˆp k s,a and then problem dependent bounds would follow in the same way as they could be proved for the Transition Dynamics Estimation. Therefore we wish to study the relation between such Dominant Term evaluated with the agent s MDP estimates vs the MDP s true parameters. Dominant Term of the Exploration Bonus The study just discussed gives the upper bound a: Dominant a Var Term of p s,a Vt+1 π Exploration Bonus n k s,a Gives Problem Dependent Bounds + H n k s,a + V ˆp k s,a nk s,a Shrinks Faster 19 In words, we have decomposed the Dominant Term of the Exploration Bonus which is constructed using the agent s available knowledge as a problemdependent contribution that is equivalent to Bernstein Inequality evaluated as if the model was known and a term that accounts for the distance between the two and shrinks faster that the former. This is precisely the Correction Bonus that we use in equation 17 and in the definition of the Algorithm itself. What gives rise to problem dependent bounds? Our analysis highlights that it is as if Euler was using Bernstein Inequality evaluated with precise domain

9 knowledge of p s,a and Vt+1 π to construct problem dependent confidence intervals with a correction term that guarantees optimism and is rapidly decaying. This correction term ˆp k s,a is a function of the inaccuracy of the value function estimate at the next-step states re-weighted by their relative importance as encoded in the experienced transitions ˆp k s,a. Said correction term is of high value only if the successor states do not have an accurate estimate for the value function and they are going to be visited with high probability. A pigeonhole argument guarantees that this situation cannot happen for too longensuring fast decayof ˆp k s,a and therefore of the whole Correction Bonus of eq. 17. Notice that such considerations and results would not be achievable by a naive application of an Hoeffdinglike inequality as the latter would put equal weight on all successor states, but the accuracy in the estimation of Vt+1 π only shrinks in a way that depends on the visitation frequency of said successor states as encoded in ˆp k s,a. The keyto enableproblem dependent bound is, therefore, to re-weight the importance of the uncertainty on the value function of the successor states by the corresponding visitation probability, which Bernstein Inequality implicitly does. 8 Related Literature Our connection with the most closely connected prior work [BT09, FPLO18, MMM14, ZB18] has been described in the main text. Here we focus on the remaining most relevant literature. Bounds that depend on the MDP mixing property: [AO06, TM18] have bounds on the mixing time but hold for MDPs that are unichain or ergodic. Most MDPs do not possess this property. Bounds function of the gap between policies: [EMM06] has bounds dependent on the minimum gap in the optimal state action values between the best and second best action, and Ucrl2 has bounds as function of the gap in the difference in the average reward between the best and second best policies. As these gaps can be arbitrarily small, the bound approaches infinity; further such bounds do not reflect additional problemdependent structure that we and others have considered here, connected to the value function. Regret bounds with value function approximation: [OR14] uses the Eluder dimension as a measure of the dimensionality of the problem and [JKA + 17] proposes the Bellman rank to measure the learning complexity, but such measures aim at capturing a different notion of hardness than ours and do not match the lower bound in the tabular setting. 9 Future Work and Conclusion In this paper we have proposed Euler, an algorithm for episodic finite state and acton MDPs that matches the best known worst-case regret guarantees while provably obtaining much tighter guarantees if the domain has a small variance of the value function over the next-state distribution, previously defined as environmental norm [MMM14]. A key feature is that our Euler does not need to know any bound on the domain environmental norm in advance. In contrast to prior problem-dependent bounds that relied on value function overestimates Φ of the range of the value function rngv π [BT09, FPLO18] across all states, we consider a tighter quantity: the range of the optimal value function restricted to successor states Φ succ in corollary 1.2 as an upper bound to the environmental norm Q. More precisely: Q Φ succ rngv π Φ 20 implying our problem dependent regret bounds based on the environmental norm are tighter than previous proposals based on the range/span of the value function. We show that the environmental norm is low for a number of important subclasses of MDPS, including: MDPs with sparse rewards, near determinisitic MDPs, highly mixing MDPs such as those closer to bandits and some classical empirical benchmarks. We also show how our result helps answer a recent open learning theory question about the necessary dependence of sample complexity and regret results on the planning/ episode horizon. Interesting directions for future work are to see if our results can also be a function of the gaps between the Q values, which is typical of bandit analyses, as well as extending these ideas to the undiscounted infinite horizon reinforcement learning and to the far more challenging function approximation setting for which even problem independent guarantees have proved to be mostly elusive. References [AMK12] Mohammad Gheshlaghi Azar, Remi Munos, and Hilbert J. Kappen. On the sample complexity of reinforcement learning with a generative model. In ICML, 2012.

10 [AO06] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In NIPS, [AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In ICML, [BCB12] Sébastien Bubeck and Nicolò Cesa- Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, [BT09] [DB15] Peter L. Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, Christoph Dann and Emma Brunskill. Sample complexity of episodic fixedhorizon reinforcement learning. In NIPS, [DLB17] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In NIPS, [EMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, [FPLO18] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained explorationexploitation in reinforcement learning [JA18] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages , [JKA + 17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langforda, and Robert E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In ICML, [JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, [KWY18] Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcement learning. Arxiv, [LH14] Tor Lattimore and Marcus Hutter. Nearoptimal pac bounds for discounted mdps. In Theoretical Computer Science, [MMM14] Odalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. how hard is my mdp? the distribution-norm to the rescue. In NIPS, [MP09] [OR14] [OR17] [OV16] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. In COLT, Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In NIPS, Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ICML, Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. In Arxiv, [OVR13] Ian Osband, Benjamin Van Roy, and Daniel Russo. more efficient reinforcement learning via posterior sampling. In NIPS, [SB98] [TM18] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Mohammad Sadegh Talebi and Odalric- Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In ALT, [WOS + 03] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J. Weinberger. Inequalities for the l1 deviation of the empirical distribution. Technical report, Hewlett-Packard Labs, [WV13] [ZB18] Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In NIPS, Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identify bandit structure in mdps. In ICML, 2018.

11 Contents 1 Introduction 1 2 Preliminaries 2 3 Euler 3 4 Main Result Problem Dependent Regret Bound Planning Horizon Dependence in Dominant Term 5 6 Problem dependent bounds Bounds using the range of optimal value function Bounds on the next-state variance of V π and empirical benchmarks Stochasticity in the system dynamics Sketch of the Theoretical Analysis 8 8 Related Literature 9 9 Future Work and Conclusion 9 A Short Proofs Missing from the Main Text 12 A.1 Euler on Chain A.2 Euler on Tabular Contextual Bandits B Appendix Overview and Proof Preview 13 B.1 Algorithm B.2 Optimism B.3 Confidence Interval for the Value Function B.4 Regret Bound C Failure Events and their Probabilities 16 C.1 Empirical Bernstein Inequality for the Rewards C.2 Admissible Confidence Intervals on the Transition Dynamics C.3 Bernstein s Inequality C.4 Other Failure Events and Their Probabilities D Optimism 19 D.1 Rewards D.2 Transition Dynamics

12 D.3 Algorithm is Optimistic E Delta Optimism 22 F The Good Set L tk 24 G Regret Analysis 25 G.1 Main Result G.2 Regret Bounds with Bernstein Inequality G.3 Regret Bound in Deterministic Domain with Bernstein Inequality G.4 Reward Estimation and Optimism G.5 Transition Dynamics Estimation G.6 Transition Dynamics Optimism G.7 Lower Order Term G.8 Auxiliary Lemmas A Short Proofs Missing from the Main Text A.1 Euler on Chain Chain MDPs are commonly given as examples of challenging exploration domains because simple strategies like ǫ-greedy can take an exponential time to learn. We now discuss an intriguing result for the chain shown in figure 1 which is nearly identical to a previously introduced one [OR17]. Like in that domain, each episode is N = H = S timesteps long. The optimal policy is to go right which yields a reward of 1 for the episode. The transition probabilities are assigned in way that scales the optimal value function so that it is of order 1. Since the rewards are deterministic we immediately obtain up to a constant that does not depend on N: Q maxvarqs,a s,a max Var Vt+1s π + s,a 1 s,a,t s,a,t N 1 1 N 1 N since Vt+1s π + is essentially dominated by a Bernoulli random variable with success parameter 1 1/N times an appropriate scaling factor of order one. Therefore Euler s regret 4 is dominated by a term which is Õ Q SAT = Õ 1 T N N 2 T = Õ. Notice that the lower order term should be added to the above expression and this is likely to be significant particularly for small T. However, we remark that the result above follows directly from Theorem 1 whose proof does not attempt to make the lower order term problem-dependent. Our result is substantially smaller than the typically reported bounds for this case, which are dominated by a Õ HSAT term. There are two main factors that lead to the above simplification for this class of MDPs: V π is of order 1 and not N = H like in hard-to-learn MDPs which yields the lower bounds [DB15] and also the variance decreases as we let N increase, each of which removes a factor of N from the known worst-case bound Õ HSAT = Õ N 2 AT after substituting N = S = H. To enable Euler s level of performance on this and other MDPs one needs to carefully control both the confidence intervals of the rewards and of the transition probabilities as Euler does, since Hoeffding-like concentration inequalities for the rewards alone would already induce a Θ SAT = Θ NT contribution to the regret expression. This result is intriguing because it suggests that truly pathological MDP classes which induce Ω HSAT regret are even more uncommon than previously thought. 4 This is valid for any class of MDP which shares these properties, implying that the regret for the MDP in figure 1 can be even smaller 21

13 A.2 Euler on Tabular Contextual Bandits Contextual multi-armed bandits are a generalization of the multiarmed bandit problem, see for example [BCB12] for a survey. In their simplest possible formulation they entail a discrete set of contexts or states {s = 1,2,...} and actions {a = 1,2,...} and the expected reward rs,a depends on both the state and action. After playing an action, the agent transitions to the next states according to some fixed distribution µ R S over which the agent has no control. In principle, such problem can be recast as an MDP in which the next state is independent of the prior state and action. Consider an H-horizon MDP which maps to a tabular contextual bandit problem: the transition probability is identical ps s,a = µs for all states and actions, where µ is a fixed distribution from which the next states are sampled. In such MDP Define the best and worst context at time t, respectively: def s t = argmaxvt def s and s t = argminvt s and recall that the transition dynamics p s,a = µ depends nor on the action a nor on the current state s. We have that: { V t s t = max a rst,a+µ Vt+1 Vt s t = max a rst,a+µ Vt+1 22 which immediately yields a bound on the range of the value function of the successor states: rngv π t+1 = V t s t V t s t = max a rs t,a maxrs a t,a 1 23 where the last inequality follows from the fact that rewards are bounded r, [0,1]. Therefore Φ 1 and corollary 1.2 yields a high probability regret upper bound of order Õ SAT + SSAH 2 S +H 24 for Euler. This means that Euler automatically attains the lower bound in the dominant term for tabular contextual bandits [BCB12] if deployed in such setting. We did not try to improve the lower order terms for this specific setting, which may give an improved bound. B Appendix Overview and Proof Preview We start by giving an overview of the proof that leads to the main result. The setting considered in the appendix is more general than the one presented in the main text. In particular we 1 define a class of concentration inequalities for the transition dynamics 2 show that Euler achieves strong problem dependent regret bounds with any concentration inequality satisfying these assumptions. In particular, the main result presented in the main text follows as a corollary of the potentially more general analysis presented in the appendix. We now give a preview of the proof of the main result, assuming rewards are known. We start be recalling Euler with yet-to-specify confidence intervals on the transition dynamcs. B.1 Algorithm The algorithm is presented in figure 2. B.2 Optimism The goal of this section is to show that Euler guarantees optimism with high probability. As is well known, optimism is the driver of exploration as it allows to overestimate the regret by a concentration term with high probability. Let s consider the planning process at the beginning of episode k. This is detailed in lines 4 to 16 of algorithm 1. In order to guarantee finding a pointwise optimistic value function tk V t π a bonus is added to account for bad luck in the system dynamics experienced up to episode k. If the agent knew the value of the confidence intervalfor the system dynamicsφp s,a,vt+1 π then optimism could be inductively guaranteedi.e., assuming that, by induction, V t+1 π holds pointwise if said confidence interval holds: tk = rs,a+ ˆp k s,a π +φp s,a,vt+1 25

14 Algorithm 2 Euler for Stationary Episodic MDPs 1: Input: confidence interval b r k, φ, with failure probability δ, constants B p,b v. 2: ns,a = r sum s,a = p sum s,s,a = 0, s,a S A; V H+1 s = 0, s S 3: for k = 1,2,... do 4: for t = H,H 1,...,1 do 5: for s S do 6: for a A do 7: ˆp = psum,s,a ns,a 8: b pv k = φˆps,a,v 1 4J+B t+1+ p +B ns,a nk v V t+1 V s,a t+1 2,ˆp 9: Qa = min{h t,ˆr k s, s,t+b r k s,a+ ˆp V t+1 +b pv k } 10: end for 11: s,t = argmax a Qa 12: V t s = Q s,t 13: b pv k = φˆps, π 1 4J+B ks,t,v t+1 p ns, πk s,t nk v V t+1 V s, s,t t+1 2,ˆp 14: V t s = max{0,ˆr k s, s,t b r k s,s,t+ ˆp V t+1 b pv k } 15: end for 16: end for 17: s 1 p 0 18: for t = 1,...H do 19: a t = s t,t; r t p R s t,a t ; s t+1 p P s t,a t 20: ns t,a t ++; p sum s t+1,s t,a t ++ 21: end for 22: end for rs,a+ ˆp k s,a Vt+1 π π +φp s,a,vt+1 rs,a+p s,a Vt+1 π V t+1 π 26 If the above conclusion is true for every action then it is true for the maximizer as well. Unfortunately the agent knows nor the real transition dynamics nor the optimal value function to evaluate φ. Instead, it only has access to the estimated transition dynamics ˆp k s,a and to an overestimate of the value function. Unfortunately the confidence interval φ evaluated with such quantities φˆp k s,a, is not guaranteed to overestimate φp s,a,vt+1 π and optimism may not be guaranteed. To remedy this the agent can try to estimate the difference φˆp k s,a, π φp s,a,vt+1 27 and add a correction term to account for that difference. A similar problem is faced in [AOM17] where the authors propose an optimistic bonus which guarantees optimism when using the empirical Bernstein Inequality. By distinction, our way of constructing the bonus works with any concentration inequality satisfying assumption 1 and 2. Precisely, optimism is dealt with in section D in the appendix; in lemma 4 we show how to bound equation 27 obtaining the result below: φˆp k s,a,v φp s,a,vt+1 π B v V V π nk s,a t+1 2,ˆp + B p +4J n k s,a. 28 This is essentially a consequence of the definition of admissible bonus, i.e., def 3. Unfortunately the upper bound in equation 28 depends on Vt+1 π which is not known, so the problem is still unsolved. However, as we show in lemma 5 in the appendix it suffices to pointwise overestimate Vt+1. π To this aim, the algorithm maintains an underestimate of Vt+1 π which we call. Equipped with this underestimate, we define the Exploration Bonus in definition 5 in the appendix, which we report below: b pv k ˆp k s,a,, def = φˆp k s,a, + 4J +B p n k s,a + B v V 2,ˆp nk s,a. 29

Reinforcement Learning

Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the