A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement learning, which outputs a policy that is PAC-MDP (Probably Approximately Correct in Markov Decision Process) and discuss its merits and drawbacks. 1 Introduction Reinforcement learning is a method by which an agent learns how to act in order to maximize the cumulative reward. A classical example is a cart with a swinging pole that learns how to adjust its velocity so that the pole balances upright for as long as possible. In this case, the cart controller is the agent, the velocity is the action, and the duration and accuracy of pole-balancing is the reward. The cart controller, in the beginning of the process, has little information about how much reward will be given if the cart moved forward at a certain speed. Over some number of interactions with the unknown environment the agent must learn which actions in which situations will maximize the payoff. This process is typically modeled by a Markov decision process with the assumption that the agent has no knowledge about the parameters of the process. In the cart-pole example given above, the cart controller faces the exploitation-exploration trade-off: at any given time, should the cart exploit its current knowledge of the environment to maximize the payoff, staying oblivious to actions with potentially better outcome, or should it explore by trying different movements in the hopes of finding a policy that will result in higher payoff, at the expense of computation time? In this exposition, we discuss how Kearns and Singh addresses this fundamental tradeoff problem by describing the E 3 algorithm, which is PAC-MDP (Probably Approximately Correct in Markov Decision Processes) as first defined in Strehl et al. [2006]: with high probability the algorithm outputs close to optimal policy in time polynomial in the size of the system and approximation parameters. This was the first ever algorithm to have 1

explicitly dealt with the explore-exploit dilemma and inspired Brafman and Tennenholtz to give a generalized algorithm (R-max algorithm) that can also be used in adversarial settings [Brafman and Tennenholtz, 2003]. Unfortunately, the E 3 algorithm may not be ideal for reinforcement learning in real world. First, Kearns and Singh assume that the agent can observe the state of the environment. The second caveat stems from the fact that the E 3 algorithm is model-based: it builds a model for the world through exploration of the environment. This approach is useful when there is a manageable number of states, but may be unfeasible if the state space is large. Although the E 3 algorithm only maintains a partial model, this may be big if, for instance, if the state space is continuous. Algorithms such as Q-learning are model-free, maintaining only a value function for each action-state pair, but its convergence is asymptotic, requiring infinite exploration to reach the optimum. 2 Preliminaries In this section we give definitions of crucial concepts used in Kearns and Singh [2002] as well as in reinforcement learning. Definition 1. A Markov decision process (MDP) M on the set of states {1, 2,..., N} with actions a 1,..., a k is defined by the transition probability PM a (ij), which is the probability that taking action a at state i will lead to state j, for each state i, j and action a, and the payoff distributions for each state i with mean R M (i) satisfying 0 R M (i) R max and variance Var M (i) Var max (i). Thus, for every state-action pair (i, a), N j=1 P M a (ij) = 1. The upper bounds on the mean and the variance of rewards are necessary for finite time convergence. Furthermore, the non-negativity assumption on rewards can be removed by adding large constants to every reward. Definition 2. A policy in an MDP M is a mapping π from the set of states to the set of actions. Hence, a policy π induces a stationary distribution on the states, as for every state there is only one action given by π. For the purpose of the exposition we only deal with unichain MDPs, which are MDPs for which every policy has a well-defined stationary distribution. That is, the action chosen by a policy from a given state does not depend on the time of the arrival at that state. Hence, the stationary distribution of any policy does not depend on the starting state. We move on to concepts regarding finite-length paths in MDPs. 2

Definition 3. A T -path in M is a sequence of T + 1 states (i.e. T transitions). The probability that a T -path p = i 1... i T +1 is traversed in M upon starting in state i 1 and executing policy π is Pr π M[p] = T k=1 P π(i k) M (i k, i k+1 ). (1) Definition 4. The expected undiscounted return along p in M is U M (p) = 1 T T R ik (2) k=1 and the expected discounted return along p in M is V M (p) = T γ k 1 R ik (3) k=1 where γ (0, 1) is a discount factor to discount the value of future rewards. The T -step undiscounted return from state i is U π M(i, T ) = p Pr π M[p]U M (p) (4) where p is any T -path in M starting at state i, and the discounted version is defined similarly. The optimal T -step undiscounted return is defined as UM (i, T ) = max π UM π (i, T ) and similarly for VM (i, T ). Furthermore, let UM (i) = lim T UM (i, T ), V M (i) = lim T VM (i, T ). Note that with the unichain assumption, in the undiscounted case, the optimal return does not depend on the starting state i so we can simply write UM. Also denote by GT max the maximum payoff obtainable from any T -path in M. We also use the notion of mixing time to describe the minimum number of steps T such that the distribution on states after T steps of executing π is within ɛ of the stationary distribution induced by π, using some measure of distance between distributions, such as the Kullback-Leibler divergence. In fact, Kearns and Singh use a notion of mixing time defined in terms of the expected return. Definition 5. Let M be an MDP and π be a policy that induces a well-defined stationary distribution on the states. The ɛ-return mixing time of π is the number T satisfying for any state i T T U π M(i, T ) U π M ɛ (5) 3

In short, whereas the standard notion of mixing time gives us an approximate distribution on the states, the ɛ-return mixing time ensures that the expected return is close enough to the asymptotic payoff. The latter is polynomially bounded by the standard mixing time (see Lemma 1, Kearns and Singh [2002]). The analogous concept in the discounted case is described in the following lemma. Lemma 1. (Lemma 2, Kearns and Singh [2002]) Let M be any MDP, and let π be any policy in M. If T (1/(1 γ)) log(r max /(ɛ(1 γ)) (6) then for any state i, V π M(i, T ) V π M(i) V π M(i, T ) + ɛ. (7) We call the value of the lower bound on T given above the ɛ-horizon time for the discounted MDP M. The following presents yet another convenient notation used throughout the paper. Definition 6. We denote by( Π T,ɛ M ) the class of all ergodic policies in an MDP M with ɛ-return mixing time T, and opt denotes the optimal expected asymptotic undiscounted return by any policy in Π T,ɛ M. Π T,ɛ M 3 The E 3 Algorithm The main theorem of Kearns and Singh [2002] states the existence of a polynomial time algorithm that returns a near-optimal policy for any given MDP. Theorem 1. (Main Theorem of Kearns and Singh [2002]) Let M be an MDP over N states. (Undiscounted case) There exists an algorithm A taking inputs ɛ, δ, N, T and opt(π T,ɛ M ), such that the total number of actions and computation time taken by A is polynomial in 1/ɛ, 1/δ, N, T, and R max, and with probability at least 1 δ the total actual return of A exceeds opt(π T,ɛ M ) ɛ. (Discounted case) There exists an algorithm A taking inputs ɛ, δ, N, and V (i), such that the total number of actions and computation time taken by A is polynomial in 1/ɛ, 1/δ, N, the horizon time T = 1/(1 γ), and R max, and with probability at least 1 δ, A will halt in a state i, and output a policy ˆπ such that VM ˆπ (i) V M (i) ɛ. Kearns and Singh prove the above by describing the algorithm explicitly. Initially, the algorithm begins by performing what is called balanced wandering, described below in Algorithm 1. In this process we take random actions from newly visited states, but from already visited states take least tried actions in order to build a balanced model of the actual world 4

Algorithm 1 Balanced Wandering if i has never been visited before then Take arbitrary action else if i has been visited < m known times then Take action that has been tried the fewest times from i If there are many of such actions, choose at random else Add i to S since i has been visited m known times end if M. While wandering, the algorithm keeps track of the empirical transition probabilities and payoff distributions for each state. Once a state has been visited sufficient number of times, we say that the state is now known, since we have a sufficiently good statistics of that state. The explore-exploit decision takes place once the algorithm reaches an already known state. If the optimal policy based on our current model is good enough by some measure, then the algorithm exploits, but otherwise, it explores other options. One of the key technical lemmas (the explore or exploit lemma) guarantees that the either our current model has good immediate payoff, or we can quickly explore into yet-unknown states. It is important to note that the exploration step can be executed at most N(m known 1) times, where m known is the number of visits to a state needed to know the state and is O(((NT G T max)/ɛ) 4 Var max log(1/δ)) (shown in Kearns and Singh [2002], page 16), so we are guaranteed to find a near-optimal policy in not too many steps. In the following sections, we discuss two major lemmas for the main theorem, and discuss the algorithm in further detail. 3.1 The Simulation Lemma The first of the two key technical lemmas states that if our empirical estimate ˆM of the true MDP M approximates M within some small error in terms of ɛ, then ˆM s expected return is within ɛ error of the true return. The measure of closeness of ˆM to M is given by the closeness of the transition probabilities and mean payoff at each state. With error ±α, Kearns and Singh call ˆM an α-approximation of M. Lemma 2. (Simulation Lemma, Kearns and Singh [2002]) Let M be an MDP over N states. (Undiscounted case) Let ˆM be an O ( (ɛ/(nt G T max) ) 2 ) -approximation of M. Then for any policy π in Π T,ɛ/2, and for any state i M U π M(i, T ) ɛ U πˆm(i, T ) U π M(i, T ) + ɛ. (8) 5

(Discounted case) Let T (1/(1 γ)) log(r max /(ɛ(1 γ))), and let ˆM be an O((ɛ/(NT G T max)) 2 )- approximation 1 of M. Then for any policy π and any state i, V π M(i, T ) ɛ V πˆm(i, T ) V π M(i, T ) + ɛ. (9) Next, the authors establish the bound on number m known of visits to any state that will guarantee the desired accuracy of the transition probability and payoff estimates. This bound is on the order polynomial in the size of M, 1/ɛ, and log(1/δ) where δ is our choice of probability of failure. 3.2 The Explore or Exploit Lemma The next lemma relies on the algorithm s partial model of the actual MDP M. Kearns and Singh introduce the induced MDP M S on S defined in an intuitive way: the payoff of state i in M S is the payoff of i in M (except the rewards have zero variance in M S ), and for every i, j S and any action, the transition probabilities are defined as in M. However, there in fact exists an additional, absorbing state s 0 for M S, representing all of the yet-unknown states. This state has zero payoff and all transitions that are not within S are redirected to s 0. Whereas M S inherits the true transition probabilities and payoffs from M, the algorithm does not have access to this information. Rather, throughout balanced wandering, the algorithm builds an empirical estimate of the transition probabilities and payoffs in M S, denoted ˆM S. The authors first establish that ˆM S is probably approximately correctly simulates M S in terms of the payoffs (Lemma 6, Kearns and Singh [2002]), and makes a simple yet important observation that the payoff achievable in M S, and thus approximately achievable in ˆM S, is also achievable in M. The following lemma is a crucial step to proving the main theorem. It states that at any time step in which we encounter a known state, either the current optimal policy in ˆM S yields an immediate payoff that is close to the optimal payoff in M, or another policy based on ˆM S will quickly explore an unknown state. Lemma 3. (Explore or Exploit Lemma, Kearns and Singh [2002]) Let M be an MDP, let S be any subset of the states of M, and let M S be the induced MDP on S 2. For any i S, any T, and any 1 > α > 0, either there exists a policy π in M S such that UM π S (i, T ) UM (i, T ) α (respectively, VM π S (i, T ) VM (i, T ) α in the discounted case), or there exists a policy π in M S such that the probability that executing π for T steps ends in s 0 is at least α/g T max. 1 There was a missing open parenthesis in the published work. It is corrected here. 2 Typo from the original paper corrected here. 6

3.3 The Explicit Explore or Exploit (E 3 ) Algorithm The E 3 algorithm is summarized in Algorithm 2. Note that balanced wandering occurs only among unknown states, since we are guaranteed to have collected good enough statistics for known states. Let us turn our attention to when the algorithm encounters a known state. By ˆM S we denote the MDP with the same transition probabilities as ˆM S but with high payoff for exploration (R ˆM S (s 0 ) = R max ) and no payoff for the known states. The policies ˆπ and ˆπ are obtained by value iteration [Bertsekas and Tsitsiklis, 1989] on ˆM S and ˆM S respectively, which takes O(N 2 T ) time. The algorithm then checks if the return from i by ˆπ is at least opt(π T,ɛ M ) ɛ/2 (respectively, VM (i) ɛ/2 in the discounted case). If so, the algorithm executes ˆπ for the next T steps where T is the ɛ/2-return mixing time (respectively, the algorithm halts and outputs ˆπ). Otherwise, the algorithm attempts exploration by executing ˆπ, which has a high reward for unknown states in M, for the next T steps in M. By the Explore or Exploit Lemma, ˆπ will enter an unknown state with probability at least ɛ/(2g T max). Algorithm 2 The Explicit Explore or Exploit (E 3 ) Algorithm Initially, the set of known states S is empty. while learning M do if the current state i / S then Perform balanced wandering. else Compute optimal policies ˆπ and ˆπ in ˆ M S and ˆM S. if ˆπ obtains a near-optimal payoff then Attempt exploitation using ˆπ. else Attempt exploration using ˆπ. end if If at any point of the attempts we visit a state i / S, perform balanced wandering. end if end while Three sources of failure are present: (1) the estimate ˆM S is not a good approximation of ˆM S, (2) the exploration attempt take too long to render a new state known, and, though, only for the undiscounted case, (3) the exploitation attempts using ˆπ repeatedly yield < opt(π T,ɛ M ) ɛ/2 rewards. For all three scenarios, Kearns and Singh show that we can bound the probability of each as small as we like with δ. 7

4 Discussion The E 3 algorithm uses only a partial model ˆM S of the entire MDP M and yet yields a nearly-optimal policy. Moreover, it was the first algorithm to demonstrate finite bounds on the number of actions and computation time required to achieve such policy. However, E 3 fails to address the two greatest difficulties in applying reinforcement learning in real world. As Kearns and Singh mentions, polynomial bounds in the number of states may be unsatisfactory if the state size is large. We must also note that all big-o bounds are in fact also polynomial in k, the size of the action space, as the authors assumed for convenience that k is a constant. Hence, the authors concern about E 3 being unfeasible with large state space extends to impracticality with large action space, or just any large system in general. Some algorithms exist for large state spaces but are only experimentally shown to work well [Jong and Stone [2007], Doya [2000]]. One promising theoretical result by Kakade and Langford outlines an algorithm that is claimed to return a near-optimal policy efficiently, where the approximation and computation parameters do not explicitly depend on the size of the system, but I have not had time to study their work. The algorithm also assumes that the agent can observe the state of the environment, which is not always the case in real world. Hence, it will be interesting to study reinforcement learning in the context of POMDPs (Partially Observable Markov Decision Processes) such as in Kimura et al. [1997]. Kearns and Singh proposes as possible future work a model-free approach. Inspired by algorithms such as the E 3, R-max, and Q-learning, Strehl et al. devised a new algorithm called Delayed Q-learning, which provably outputs a probably approximately optimal policy in a model-free setting in polynomial time. References D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989. R. I. Brafman and M. Tennenholtz. R-max - a General Polynomial Time Algorithm for Near-optimal Reinforcement Learning. J. Mach. Learn. Res., 3:213 231, Mar. 2003. ISSN 1532-4435. doi: 10.1162/153244303765208377. URL http://dx.doi.org/10.1162/ 153244303765208377. K. Doya. Reinforcement learning in continuous time and space. Neural computation, 12(1): 219 245, 2000. N. K. Jong and P. Stone. Model-based Function Approximation in Reinforcement Learning. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS 07, pages 95:1 95:8, New York, NY, USA, 2007. ACM. ISBN 8

978-81-904262-7-5. doi: 10.1145/1329125.1329242. URL http://doi.acm.org/10.1145/ 1329125.1329242. S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267 274, 2002. M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209 232, 2002. H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in pomdps with function approximation. In ICML, volume 97, pages 152 160, 1997. A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881 888. ACM, 2006. 9