Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning

Size: px

Start display at page:

Download "Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning"

Laureen Haynes
5 years ago
Views:

1 Freie Universität Berlin Fachbereich Mathematik und Informatik Master Thesis Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning Michael Borst Advisor: Prof. Dr. Marc Toussaint Berlin,

3 Abstract Exploration is a key part of reinforcement learnning. In the classic setting, autonomous agents are supposed to learn a model of their environment to succesfuly complete a task. Recent works in the field and in related fields have suggested the use of quantities based on Shannon s information theory to enable agents to do so. The underlying concepts of exploration vary between those works. In this thesis, these different notions of exploration will be introduced and compared. Further, two algorithms based on established dynamic programming methods are introduced to maximize two information theoretic quantities, the entropy of the state distribution and predictive information, a quantity relating the past and the future of the agent. These algorithms are evaluated in two settings: planning with the true world model and in interaction with the environment without prior knowledge. Entropy maximation proved to be possible in both settings while predictive information maximization was only succesful in the first. The behavior resulting from maximizing these quantities is also analyzed. 3

5 Contents 1 Introduction Outline Reinforcement Learning The Reinforcement Learning Problem Markov Decision Processes Environments Planning under Uncertainty An Overview of Reinforcement Learning Methods Sample complexity and PAC-MDP Temporal Difference Learning Model-Based Learning Bayesian Reinforcement Learning Shannon s Information Theory Fundamental Quantities of Shannon s Information Theory Information Theoretic Properties of Stochastic Processes Contrasting Different Notions of Explorative Behaviour Exploration in Reinforcement Learning Information Theoretic Measures in Reinforcement Learning Information Gain Entropy Predictive Information Similarities and Differences Maximizing Information Theoretic Quantities Information Theoretic Quantities in MDPs The State Distribution State Entropy Predictive Information Modification of the Standard Methods Reward Functions

6 5.2.2 Q-Iteration Policy Iteration Action Selection Evaluation of Planning Algorithms Entropy Maximization Predictive Information Maximization Discussion Reinforcement Learning Evaluation Model Accuracy Information Theoretic Quantities Discussion Conclusion Future Research Bibliography 67 Declaration of Academic Integrity 70 6

7 7 1 Introduction The idea of intelligent machines has a long history. Alan Turing, one of the founding fathers of modern computer science, already entertained this idea in his seminal article Computing Machinery and Intelligence in 1950 [24]. In this article, he not only introduced the Turing test to assess the intelligence of a machine, he also suggested machines could learn by trial and error. This approach, termed reinforcement learning, has been applied to a vast array of problems by now, from scheduling [29] to autonomous helicopter flight [12]. In the classic reinforcement learning setting, an agent has to learn how to accomplish a task in an unknown environment by interaction with this environment with only a reward signal as guidance. This reward signal, and thus, the task, does not originate from within the agent, it is an external motivation. Everything the agent learns is dedicated to solving that task. But what if the agent does not have such a specific purpose? What if it is supposed to just familiarize itself with its environment, to learn what can be done within this evironment or to just generally behave in a certain fashion? The motivation has to be relocated, it has to become an intrinsic drive to learn or to act. Recent work in this respect has applied Shannon s information theory to create intrinsic motivation ([17], [1], [28], [16], [23], [11]). While the intentions behind these applications vary, all share the use of information therotic quantities to derive an intrinsic motivation for the agent to behave in a certain way. This thesis will give an overview of the information theoretic approaches to reinforcemt learning focused on those which try to induce learning or explorative behavior in the agent, elaborate on their intentions and compare them to classic reinforcement learning. Further, it will present modifications of existing algorithms to maximize information theoretic quantities, namely entropy and predictive information, the latter inspired by [1]. It will evaluate the ability of these algorithms to do so with a model of the environment available and without such a model. 1.1 Outline The remainder of this work is structured as follows: Section 2 will introduce the basic reinforcement learning problem, a model to formalize it, and ways to solve it. Section 3 will familiarize the reader with Shannon s information theory and its application to stochastic processes. Section

8 8 1 Introduction 4 will elaborate more on the role of exploration in classic reinforcement learning followed by a presentation of different works using information theory in the context of reinforcement learning and a comparison of both. Section 5 will then present algorithms that maximize entropy and predictive information. These algorithms will be evaluated as planning algorithms given a model of the environment in Section 6 and in an unknown environment in Section 7. The thesis will be finalized by a summary and an outlook over possible future work in Section 8.

9 2 Reinforcement Learning This section will introduce the reinforcement learning problem, the Markov decision process as formalization of this problem, basic methods of determining optimal behavior

9 9 2 Reinforcement Learning This section will introduce the reinforcement learning problem, the Markov decision process as formalization of this problem, basic methods of determining optimal behavior given the required knowledge, and methods that both acquire this knowledge and enable optimal behavior. The first three sections are based on the intrdouction to reinforcement learning by Sutton and Barto [22] and the reader is referred to this book for a more detailed introduction. 2.1 The Reinforcement Learning Problem Reinforcement learning is a subfield of machine learning that is concerned with learning how to achieve goals through interaction. A decision maker, called agent, is set in an unfamiliar environment. The agent is able to observe the state s t of the environment at a time t through some kind of sensor. In reaction to this state, it can interact with the environment by performing an action a t. For this action, it receives a reward r t+1 and then transitions into a new state s t+1. One interaction is therefore defined by the 4-tuple (s t, a t, s t+1, r t+1 ) 1. This interface for the agent-environment interaction is illustrated in Figure 2.1. The transitions from state to state Figure 2.1: The agent-environment interface [22]: the agent observes state s t and reacts with action a t. This results in a reward signal r t+1 and a new environment state s t+1. via actions are stochastic, i.e., the successor state of an action in a state is not always the same. The agent s goal is to maximize the cumulated reward over time, or return R T R T = r t+1 + r t+2 + r t r T = r i (2.1) i=0 1 Where unambiguous, s is used fo s t+1, s for s t+1, a for a t and r for r t+1.

10 10 2 Reinforcement Learning where T is the final time step. If T is finite, one speaks of a finite horizon setting. For T =, this definition is problematic, since the return is no longer guaranteed to be finite. Therefore, for tasks that do not have clearly definied stopping point, discounting is used. So, in the infinite horizon setting, the discounted return is R = r t+1 + γr t+2 + γ 2 r t+3 + = γ i r t+i+1 (2.2) where 0 γ 1 is a discount rate. The discount rate determines how much immediate reward is prefered over future reward; the lower the rate, the less future reward is considered. To mazimize its return, the agent needs to have knowledge about the properties of its environment. Therefore, it needs to gather information about states and actions it hasn t seen (often enough) yet by sampling them. But doing so means disregarding actions that are considered more rewarding given the current knowledge of the agent. This dilemma is known as the exploration-exploitation tradeoff and is one of the central problems in reinforcement learning. Every algorithm must present some kind of solution to this problem. The reinforcement learning problem differs from other problems in machine learning. In contrast to supervised learning, where there is a set of pairs of input and desired output, the agent always has to produce the learning data through its own behavior. But unlike unsupervised learning, there is feedback which the agent can use to adapt its behavior to the environment. In the next section, Markov decision processes will be introduced as a formal model for reinforcement learning. i=0 2.2 Markov Decision Processes The most common formal model for reinforcement learning is the Markov decision process (MDP). An MDP is a discrete-time stochastic control process and is defined by the 4-tuple (S, A, p, R) with the set of states S, the set of actions A, the state transition distribution p(s s, a), s, s S, a A and the reward function or signal R : S A S R.

11 11 The reward function is bounded. If both the set of states and the set of actions are finite, the MDP is called a finite MDP. From here on in, every MDP is considered a finite MDP. Furthermore, the state transition probability distribution is considered to be stationary, that means to be conditionally independent of time. An important property of MDPs is the Markov property. Definition 2.1 A stochastic process possesses the Markov property if p(x n+1 = x n+1 X n = x n, X n 1 = x n 1,... X 0 = x 0 ) = p(x n+1 = x n+1 X n = x n ) (2.3) In other words, the future of the MDP depends solely on the present and not on the past. Therefore, the agent only has to take the current state into account when selecting an action. What action a the agent selects in reaction to a state s depends on his policy π. Policies can be either deterministic or stochastic. In this work, deterministic policies will be denoted as π(s) and stochastic policies as π(a s). Policies are stationary Environments There are three environments that will be used in this work. They will now be presented for later use. The first environment will be referred to as dense world. This environment has the same number of actions and states. The actions have the following transition probabilities: i, j, k [1, S ] : p(s k s i, a j ) = p 1 p S 1 if j = k else where p is usually significantly bigger than 0.5, e.g. towards one succesor state. p = 0.8, so that every action is biased The second environment consists of an arbitrary number of states and two actions and is referred

12 12 2 Reinforcement Learning to as circle world. The circle world has the following transition probabilities: p if j = i + 1 or i = S and j = 1 p(s j s i, a 1 ) = 1 p if j = i 0 else i, j [1, S ] : p if j = i 1 or i = 1 and j = S p(s j s i, a 2 ) = 1 p if j = i 0 else where p is again significantly bigger than 0.5. This environment is called circle world because if the agent always choses action a 1 or always choses a 2, it will have a trajectory resembling a circle. The third environment is referred to as grid world. It is a two dimensional environment that consists of an arbitrary number of states and four actions. The states have coordinates (x, y). The four actions can be interpreted as east (a e ), west (a w ), north (a n ) and south (a s ). As one would expect, performing one of this actions is biased to lead to an one-step increase in x, decrease in x, increase in y or decrease in y, respectively, with probability p. The probability 1 p is equally distributed among the states in the other directions if they exist, so for a state that has neighboring in three other directions, the probability to end up in these states instead of the state towards which the action is based is 1 p 3 1 p, for two other directions it is 3, and for one it is 1 p. If there is no state in the direction the action is biased towards, the agent simply stays in its current state with probability p and probability 1 p is equally divided among the available neighboring states. The parameter p is usually global, though there will be a variant of this environment that uses a random probability p for each state-action pair and has selected state-action pairs not biased towards one succesor state. 2.3 Planning under Uncertainty Having defined a formal model of the environment, the following section will elaborate more on finding a policy that maximizes return when given an MDP, or solving the MDP. Since the outcome of actions in the environment is stochastic, this is referred to as planning under uncertainty. The methods presented here are cases of dynamic programming, a concept invented by Richard Bellman [2]. First, a more precise formulation of the return in an MDP is needed. When following a fixed

13 13 policy π and starting in a state s, the expected discounted return or (state) value V π (s) of a state s is V π (s) = E π {r 1 + γr 2 + γ 2 r s 0 = s}. (2.4) The value enables the agent to assess the desirability of being in a certain state. The value can be reformulated through its recursive property so that it becomes more explicit: V π (s) = E π {r 1 s 0 = s} + γe π {r 2 + γr s 1 = s } = R(s, π(s), s ) + γ p(s s, π(s))e π {r 2 + γr s 1 = s } s = s p(s s, π(s)) [R(s, π(s), s ) + γv π (s )] (2.5) or in case of a stochastic policy as V π (s) = a π(a s) s p(s s, a) [R(s, a, s ) + γv π (s )]. (2.6) Alternatively, when the agent performs action a in state s and follows a fixed policy π thereafter, the state-action value Q π (s, a) is given by Q π (s, a) = s p(s s, a) [R(s, a, s ) + γq π (s, π(s ))] (2.7) or in case of a stochastic policy as Q π (s, a) = [ p(s s, a) R(s, a, s ) + γ ] π(a s )Q π (s, a ) s a (2.8) Both values relate states to the expected discounted return by weighted propagation of future rewards through the possible sequences (s 0, a 0, r 1, s 1, a 1, r 2,... ). State value and state-action value are interchangeable, since V π (s) = Q π (s, π(s)) or V (s) = a policies. π(a s) Q π (a, s) for stochastic Equations 2.5 and 2.7 are the Bellman equations for the state and state-action value function. Iterating either V π or Q π leads to convergence to the value or state-action value function for the corresponding policy π. This application of dynamic programming is called policy evaluation. Based on the value function, optimality can be easily defined. A policy is optimal if it maximizes the state value in every state: s S : V π (s) = V (s) where V (s) = max π V π (s) (2.9)

14 14 2 Reinforcement Learning For every MDP there exists at least one deterministic policy that is optimal with regards to the value. Bellman s principle of optimality makes it possible to find this optimal policy. This principle states that for any initial state, a policy that selects the action that maximizes the value of the state and from there on out is equal to the optimal policy is optimal. The Bellman optimality equation for the value is V (s) = max a [ s p(s s, a) (R(s, a, s ) + γv (s )) ] (2.10) with the corresponding optimal policy [ ] π (s) = argmax a s p(s s, a) (R(s, a, s ) + γv (s )) Iterating the equation V k+1 (s) = max a [ s p(s s, a) (R(s, a, s ) + γv k (s )) ] (2.11) (2.12) for all states is called value iteration and lets V k converge to V, which implicitly contains the optimal policy via Equation The Bellman optimality equation for state-action value function is Q (s, a) = [ ] p(s s, a) R(s, a, s ) + γ max Q (s, a ) a s (2.13) with the corresponding optimal policy π (s) = p(a s) = argmax Q (s, a). (2.14) a Repeated application of Q k+1 (s, a) = [ ] p(s s, a) R(s, a, s ) + γ max Q k (s, a ) a s (2.15) for all states and actions is called Q-Iteration and converges to Q (s, a). Q-Iteration contains the optimal policy explicitly, since the agent only has to chose the action with the highest state-action value. There exists another method to solve an MDP. It consists of two steps: 1. Evaluate policy π to obtain Q π (s, a).

15 15 2. Select a new policy π(s) = argmax a Q π (s, a). These two steps are applied in turns until the policy stops changing. This method is called policy iteration. These planning algorithms enable the agent to perform exploitation easily if the true world model is known. But the agent doesn t actually have this model. The algorithms presented in the following section will show ways of dealing with this situation, or how to learn. 2.4 An Overview of Reinforcement Learning Methods As explained earlier, planning how to maximize return when the dynamics of the environment are known is not the reinforcement learning problem. Instead, the agent has no knowledge about its environment at the beginning of its task and has to autonomously gather it. The agent has to perform exploration. The difficulty is that the two things the agent has to do - exploration and exploitation - are somewhat opposed: at any given time, the agent can either choose to try to increase its knowledge or to use it to obtain reward. Some way of dealing with this so-called exploration-exploitation tradeoff is needed. In reinforcement learning, there are three basic methods of learning. Model-free learning, where some value function is approximated, is represented here through temporal difference learning and introduced first. Second, there are model-based algorithms which try to efficiently learn a model of the system and then solve this model through dynamic programming. These are represented through RMAX approach embedded in that framework. Finally, another modelbased solution is introduced: Bayesian reinforcement learning, in which a Bayesian optimal policy is approximated. The last learning method is policy search, where a policy is directly learned from the data, e.g. through policy gradients. This group of methods will not be introduced here. Before these methods are introduced, several measures of complexity for reinforcement learning algorithms will be defined, along with a framework for assessing efficiency Sample complexity and PAC-MDP There are three relevant measures of complexity in reinforcement learning. Computational complexity is the amount of time the agent needs to perform the required computations for each time step. Space complexity is the amount of memory the agent needs to store the required

16 16 2 Reinforcement Learning information for his computations, e.g. an agent would need at least S A entries to store the state-action value function. The third quantity is more complex. Definition 2.2 (Sample complexity) Let c = (s 0, a 0, r 1, s 1, a 1,..., a t 1, r t, s t ) be a random path generated by executing an algorithm A in an MDP M. For any fixed ɛ > 0, the sample complexity of exploration of A is the number of timesteps t such that the policy at time t, A t, satisfies V At (s t ) < V (s t ) ɛ. Sample complexity [8] is the amount of timesteps for which the algorithm s, or the agent s, return is more than ɛ worse than the optimal return. Put differently, it is the amount of samples needed for the agent to perform sufficiently well. Sample complexity is rather important, since in real reinforcement learning settings, sampling can take a lot of time. Based on these three complexity measures, Strehl et al. [18]. introduced the PAC-MDP concept Definition 2.3 (PAC-MDP) An algorithm A is said to be an efficient PAC-MDP (Probably Approximately Correct in Markov Decision Processes) algorithm if, for any ɛ > 0 and 0 < δ < 1, the per-timestep computational complexity, space complexity, and the sample complexity of A are less than some polynomial in the relevant quantities (S, A, 1/ɛ, 1/δ, 1/(1 γ)), with probability at least 1 δ. It is simple PAC-MDP if the definition is relaxed to have no computational complexity requirement. An efficient PAC-MDP algorithm thus performs sub-optimal only in a number of time steps that is polynomial in the mentioned quantities. PAC-MDP is an important formal framework for efficiency in reinforcement learning. It is also used to derive upper and lower bounds for these algorithms. There are model-free as well as model-based efficient PAC-MDP algorithms Temporal Difference Learning Temporal difference learning was introduce by Sutton in 1988 [21]. In this form of learning, no model of the dynamics of the environment, is maintained, which is why they are called modelfree. Instead, a value function is learned. Specifically, temporal difference methods iteratively improve an approximation of the value function using a previous estimate of it, a concept called bootstrapping. In the simplest form, TD(0), the approximation s updated according to the

17 17 update step V (s t ) V (s t ) + α[r t+1 + γv (s t+1 V (s t )]. (2.16) This can be online done after every experience and converges to an optimal estimation of the value function in the sense that it would be correct for the maximum-likelihood model of the corresponding Markov process. The value function approximated is of course the value function of the policy that the agent follows. To get to an algorithm that learns the optimal value function V and therefore is able to successfully exploit, some modifications have to be made. Two such modifications shall be discussed here, SARSA and Q-learning. Algorithm 2.1: SARSA 1 Initialize Q(s, a) arbitrarily 2 Draw start state s 0 3 Choose action a 0 ɛ argmax a Q(s 0, a) 4 for t = 0, 1, 2, 3,... do 5 Execute a t, observe r t+1, s t+1 6 Choose action a t+1 ɛ argmax a Q(s t+1, a) 7 Q(s t, a t ) Q(s t, a t ) + α [r t+1 + γq(s t+1, a t+1 ) Q(s t, a t )] 8 end SARSA [14] is an on-policy temporal difference learning algorithm. On-policy means that, similar to policy iteration, it follows a policy π and approximates the corresponding state-action value function Q π (s, a) while simultaneously optimizing the policy with respect to Q π. Algorithm 2.1 shows a concrete version of SARSA with ɛ-greedy action selection. A 5-tuple (s t, a t, r t+1, s t+1, a t+1 ) - hence the name SARSA - is generated by the agent by selecting action a t in the current state s t, obtaining some reward r t+1, observing the follow-up state s t+1 and choosing another action a t+1. Actions are selected with a method that incorporates some exploration mechanism, explained in detail late. The sample is than used to update the estimate of the state-action value function according to Q(s t, a t ) Q(s t, a t ) + α [r t+1 + γq(s t+1, a t+1 ) Q(s t, a t )]. (2.17) Note that the tuples used in adjacent updating steps overlap, meaning that the first action for the next step is already selected in the current step and before the state-action value function is updated. This is why SARSA is an on-policy algorithm. Sutton and Barto mention in their

18 18 2 Reinforcement Learning book that convergence to an optimal function is ensured if all state-action pairs are sampled an infinite number of times and the policy converges to a greedy policy, this will be referred to later. Q-learning, on the other hand, is an off-policy temporal difference learning algorithm introduced by Watkins in 1989 [25]. It directly approximates the optimal state-action value function Algorithm 2.2: Q-Learning 1 Initialize Q(s, a) arbitrarily 2 Draw start state s 0 3 for t = 0, 1, 2, 3,... do 4 Choose action a t ɛ argmax a Q(s t, a) 5 Execute a t, observe r t+1, s t+1 6 Q(s t, a t ) Q(s t, a t ) + α [r t+1 + γ max a Q(s t+1, a) Q(s t, a t )] 7 end Q π (s, a). A concrete version is formulated in Algorithm 2.2, again with ɛ-greedy action selection. The difference in the value update equation [ Q(s t, a t ) Q(s t, a t ) + α r t+1 + γ max a ] Q(s t+1, a) Q(s t, a t ) (2.18) in comparison to SARSA is that the follow-up action a t+1 = argmax a Q(s t+1, a) is not chosen according to a policy but greedily to maximize the future value, it is chosen off-policy. In this regard, it is closer to value iteration and follows the notion of Bellman s principle of optimality - if an action maximizing the value in the current state is chosen and the agent follows an optimal policy from there on out, the agent maximizes its return. In 1992, convergence to the optimal value function was proven [26]. Both algorithms address the exploration-exploitation tradeoff through the action selection mechanism they use. In the algorithm instances presented here, ɛ-greedy action selection is used, where with probability 1 ɛ, the action a = argmax a Q(s, a) that maximizes the state-action value function is chosen, and a random other action with probability ɛ, with 0 ɛ 1. This ensures exploration in a very simple way, and since every action is sampled infinitely, it helps to prove convergence to the optimal value function. Yet, of course, the agent never maximizes its return since it does not always take the optimal action. The parameter ɛ determines how strong exploration is valued over exploitation. For example, if ɛ = 0.1, the algorithm will converge faster

19 19 than if ɛ = 0.01, but after some time, the return will be higher for the latter, since the maximizing action is chosen more frequently. For ɛ = 1, the agent chooses actions randomly, while for ɛ = 0, the agent never takes a non-optimal action, it acts greedy all the time. Furthermore, for SARSA, the policy has to converge to a greedy policy to ensure convergence to the optimal policy, so ɛ has to decrease over time, e.g. ɛ = 1 t. The problem with this modification is that some knowledge is needed in advance to determine how fast ɛ should decrease. For a very large state space S or action space A, it should obviously decrease slower than if there were only two states and two actions, for example. Another method is softmax or Boltzmann action selection. The action a for a state s with corresponding state-action value function Q(s, a) is selected according to a Boltzmann distribution and thus π(a s) = exp(q(s, a)/τ) a exp(q(s, a)/τ) (2.19) whit temperature τ. For τ, softmax action selection is equivalent to random or uniform action selection, for τ 0 it is equivalent to the greedy action selection. This selection method does not randomly select one non-optimal action with a certain probability, but ranks the actions according to their respective value. Again, the parameter τ can be chosen as a function of time. According to Sutton and Barto, there is no comparative studies about which type of action selection to prefer with regards to performance, but ɛ-greedy is more common because of its more intuitive parameter setting. While computational and space complexity of these model-free approaches is very low compared to the algorithms introduced below, they have a high sample complexity. Furthermore, an algorithm that uses ɛ-greedy exploration can never be an efficient PAC-MDP algorithm, since the sample complexity is exponential in the number of states [27]. Efficient PAC-MDP model-free algorithms exist, though, for example delayed Q-learning [19]. A general problem of model-free approaches is that they learn the value function for one task and one task only. They can not apply the knowledge gained from one task an environment with the same dynamics but a different reward function. Of course they adapt, but they would have to learn a whole new value function insted of just computing it from the known model, like model-based algorithms could. Furthermore, one could argue that they gain no actual knowledge of the world, just about the goodness of actions in states given the task. The value function contains no information about where an action leads the agent or with what probability.

20 20 2 Reinforcement Learning Model-Based Learning In this section, the model-based algorithm RMAX will be introduced as a model-based efficient PAC-MDP algorithm. RMAX was first published in 2001 [4], but the PAC-MDP modification shown in Algorithm 2.3 is from Strehl et al. [18]. RMAX is a manifestation of the principle optimism in the face of uncertainty. Putting it very simple, every state-action pair that has not been sampled enough times is considered good and therefore rewarded with maximum reward, hence the name RMAX. This is how RMAX encourages exploration. From the samples, a model is built and used to plan to find an optimal policy. To understand why such a simple principle is considered a good algorithm, a closer look is needed. An RMAX agent starts with an optimistic initial state-value function U(s, a) that is constant for all state-action pairs and guaranteed to be an upper bound of the true value function. The agent s sample counters n(s, a, s ), n(s, a) = s n(s, a, s ) and r(s, a, s ) (sum of rewards obtained for performing a in s and reaching s ) are set to zero. It then starts interacting with the environment and updates its counters. As soon as it has seen any state-action pair (s, a) m times, it uses the counters to estimate the transition probabilities ˆp(s s, a) = n(s, a, s ) n(s, a) s S and the reward function ˆR(s, a, s ) = r(s, a, s ) n(s, a, s ) s S with maximum likelihood. The partial model made up from these estimations is then used to plan, in this case via value iteration. Note that the state-action value is only updated for all known pairs and only if a new pair becomes known, yet the value of unknown pairs can be part of the value update as future value. This is important, since no exploration would happen otherwise. As mentioned before, RMAX is an efficient PAC-MDP algorithm if the right values parameters c and m are chosen. For a full analysis, the reader is referred to the original papers. E 3 (Explicit Exploit or Explore), another model-based efficient PAC-MDP algorithm that uses the optimism in the face of uncertainty principle, was published by Kearns and Singh in 2002 [9]. An E 3 agent maintains two models. One is the maximum likelihood estimate M k nown including all known states. M unknown, the other one, consists of all the known states with

21 21 Algorithm 2.3: R-MAX 1 foreach (s, a) S A do 2 Q(s, a) U(s, a) 3 n(s, a) 0 4 foreach s S do 5 n(s, a, s ) 0 6 r(s, a, s ) 0 7 end 8 end 9 Draw start state s 0 10 for t = 0, 1, 2, 3,... do 11 Choose action a t ɛ argmax a Q(s 0, a) 12 Execute a t, observe r t+1, s t+1 13 if n(s t, a t ) < m then 14 n(s t, a t ) n(s t, a t ) r(s t, a t, s t+1 ) r(s t, a t, s t+1 ) + r t+1 16 n(s t, a t, s t+1 ) n(s t, a t, s t+1 ) if n(s t, a t ) = m then 18 for i = 1, 2, 3,..., c do 19 foreach ( s, ā) do 20 if n( s, ā) m then 21 Q( s, ā) [ ] s ˆP (s s, ā) ˆR( s, ā, s ) + γ max a Q(s, a ) 22 end 23 end 24 end 25 end 26 end 27 end

22 22 2 Reinforcement Learning the same dynamics but zero reward for known states and a fictitious state to which all unknown transitions lead and which has maximum reward. When in a known state, both MDPs are solved. If the value for the policy resulting from planning in M known is high enough, the agent follows this policy, it exploits. Else, it follows the policy derived from solving M u nknown, resulting in planned exploration. If the state the agent is in is unknown, it performs the action it has performed the fewest times. The basic concepts to take away from this section are optimism in the face of uncertainty and PAC-MDP efficiency. The agent believes that actions it doesn t know are actions that will lead to reward and under this assumption it is able to perform approximately optimal with a certain probability of error in polynomial time with regards to the required proximity to optimality, accepted probability of error and parameters of the model of the environment. This is important knowledge, although the number of samples needed to achieve sufficient proximity to optimal behavior with a satisfactory probability is very high Bayesian Reinforcement Learning Bayesian Reinforcement Learning is a solution to the exploration-exploitation tradeoff that is different from the ones introduced above. In the Bayesian reinforcement learning setting, the uncertainity about the model is explicitly modeled - the agent maintains a belief b over the model. This belief is incorporated into the value function, which leads to the Bellman equation for the Bayesian state value V π (b, s) = p(b, s s, a, b) [R(s, a, s ) + αv π (b, s )] (2.20) b,s where a = π(b, s). In a discrete environment, the belief can be easily done through a set of Dirichlet distributions b = {α(s, a, s )}, p(s b, s, a) = α(s, a, s ) α 0 (s, a) where α(s, a, s ) is simply a counter for the number of samples (s t = s, a t = a, s t+1 = s ) and α 0 = s α(s, a, s ). The counters are initialized so that they represent a prior over the model. For every experience, the agent then increments the corresponding counter and thus, obtains a new belief over the model. Under these assumptions, Equation 2.20 can be simplified to V π (b, s) = s p(s b, s, a) [R(s, a, s ) + αv π (b, s )] (2.21)

23 23 because the new belief b follows deterministically from the current one and the experience for the given belief update rule. Bellman s optimality equation for the Bayesian value follows from Equation 2.21 by selecting the actions maximizing the value and is { } V (b, s) = max p(s b, s, a) [R(s, a, s ) + αv π (b, s )] a s (2.22) with the Bayesian optimal policy π (s) = argmax π V (b, s). Using the Bayesian value function to guide the agent leads to optimal behavior with regards to the prior over the model. So, rather than ensuring exploration through action selection or through rewards given for unknown state-action pairs, Bayesian reinforcement learning explicitly includes the agent s uncertainty over its model directly into the formulation of the expected return. This inclusion naturally induces exploration. If a new experience for a tuple (s, a) might lead to a significantly different model with a higher expected return, that action will be selected. The problem with the Bayesian approach is that, in general, it is not tractable. There are various methods to approximate the optimal Bayesian policy or value function ([7], [6], [20], [13]), the approach presented here is chosen because it provides formal guarantees similar to PAC-MDP. The Bayesian Exploration Bonus (BEB) algorithm introduced by Kolter and Ng [10] defines the optimal value Ṽ H (b, s) over the next H time steps as { where Ṽ H(b, s) = max a R(s, a) + β 1 + α 0 (s, a) + s p(s b, s, a)ṽ H 1 } (2.23) β 1+α 0(s,a) is the Bayesian exploration bonus. Parameter β is of importance for the boundaries presented later. Since this equation does not use the updated belief b, standard dynamic programming can be applied to solve it. The lesser a state-action pair has been sampled, the higher the bonus. It is assumed that the reward function is known in advance, yet this does not affect generality since every MDP with unknown bounded reward function can be remodeled into an MDP with known reward by adding states to it. Kolter and Ng use the finite horizon case because their theorems build on parameter H. The extension to the infinite horizon setting is an open question, but nevertheless this approach introduces an interesting solution to the exploration-exploitation tradeoff. Using this approximation, Kolter and Ng provide the following bound:

24 24 2 Reinforcement Learning Theorem 2.1 Let A t denote the policy followed by the BEB algorithm (with β = 2H 2 ) at time t, and let s t and b t be the corresponding state and belief. Also suppose we stop updating the belief for a state-action pair when α 0 (s, a) > 4H 3 /ɛ. Then with probability at least 1 δ, V At H (b t, s t ) VH(b t, s t ) ɛ for all but time steps. m = O( S A H6 ɛ 2 log S A ) δ So, similiar to PAC-MDP, a BEB agent is guaranteed to act sub-optimal only for a polynomial number of time steps. In fact, the bound is tighter than the PAC-MDP sample complexity m = Õ( S 2 A H 6 ɛ 3 ) This bound are with regard to closeness to two different optimal value functions, though, the optimal Bayesian value function and the optimal value function for some given model. Bayes optimality requires less exploration because for a sufficiently certain transition probability, the expected return does not change significantly when the model is updated. formalized by the following theorem: This intuition is Theorem 2.2 Let A t denote the policy followed by an algorithm using any exploration bonus that is upper bounded by β n(s, a) p for some constant β amd p > 1/2. THen there exists some MDP M and ɛ 0 (β, p), such that with probability greater than δ 0 = 0.15, V At H (s t) < V H(s t ) ɛ 0 will hold for an unbounded number of time steps. In other words, any algorithm with an exploration bonus that decays faster than 1/ n, such as BEB, can not be PAC-MDP and may no find an optimal policy with regards to the state value. Bayesian reinforcement learning in general introduces the uncertainty over the model (or the models in other Bayesian approaches) explicitly into the prediction of return. The BEB alogrithm introduced here has lower sample complexity than the most efficient PAC-MDP algorithm. This

25 25 steems from the smaller amount of exploration needed to be close to Bayes optimality. It remains to be seen if it can be applied to the infinite horizon case.

26 26 3 Shannon s Information Theory 3 Shannon s Information Theory When Shannon defined entropy 1948, it was part of his attempt to establish A Mathematical Theory of Communication [15]. This theory was supposed to find boundaries of information compression and transmission of such information. Therefore, he was looking for a measure of the rate at which an information source produces information. More general, the quantity was supposed to describe the uncertainty over the occurrence of one out of several possible events, or the inherent complexity of the process underlying these occurrences. It turned out that Shannon s work was the key concept for a theory that has far more possible fields of application than just communication theory - information theory. Cover and Thomas [5] give an extensive overview of the application of Shannon s information theory which ranges from the original field of communication theory to computer science (where entropy is approximately equal to Kolmogorov complexity, the minimal description length of a data sequence), statics and economics. This work draws heavily on Cover s work and the reader is referred to it for proofs and deeper insights into the field of information theory. In the remainder of this section, the fundamental quantities of Shannon s information theory and their properties will be introduced, followed by an analysis of Markov processes from the perspective of this theory for use in later sections of this work. From here on out, the term information theory will refer to the theory built around Shannon s entropy concept and its properties. 3.1 Fundamental Quantities of Shannon s Information Theory At the core of information theory lies the entropy H(X) of a discrete random variable X with probability distribution p(x). Definition 3.1 The entropy H(X) of a discrete random variable X is defined by H(X) = x p(x) log p(x). (3.1) It is common to use the convention 0 log 0 = 0. Since entropy was first introduced in communication theory, it is usually expressed in bits and the logarithm is to the base of 2. It can be interpreted as the average number of bits needed to describe the random variable, a measure of

27 H(X) p Figure 3.1: The entropy H(X) of the Bernoulli distribution. It is 0 for p = 1 or 0 and maximum for p = 0.5. its uncertainty or the expected information gained by knowing the value of it. The entropy of a random variable and its distribution are the same, H(X) = H(p(x)). Example 3.1 Consider a random variable X with a Bernoulli distribution, that is, 1 with probability p X = 0 with probability p - 1 The plot of H(X) is shown in Figure 3.1. For p = 0 or 1, the entropy is 0 - the value is known in advance, there is no uncertainty. For p = 1 p = 0.51, the entropy is maximum - the value can only be guessed, there is nothing but uncertainty. As can be expected from the various interpretations, the entropy is never negative - no information can be lost through knowing the outcome of a event. Lemma 3.1 H(X) 0. This follows easily from 0 p(x) 1 log p(x) > 0 and the fact that a weighted sum of positive values will always be positive. It is 0 when there is no uncertainty or 1 x X : p(x = x) = 1. Since p(x) is a probability distribution, it follows that the probability for all other values is 0 and the entropy H(X) = 1 log 1 = 0. Definition 3.1 can be easily extended to the entropy of a joint distribution.

28 28 3 Shannon s Information Theory Definition 3.2 The joint entropy H(X, Y ) of two discrete random variable X, Y is defined by H(X, Y ) = x,y p(x, y) log p(x, y) (3.2) The entropy of a conditional distribution comes similarly natural. Definition 3.3 The conditional entropy H(X Y ) of two discrete random variable X, Y is defined by H(X Y ) = x,y p(x, y) log p(x, y) p(y) The measures introduced up until now allow to describe random variables in terms of the information they contain. But how can different distributions about a random variable be compared? Definition 3.4 The Kullback Leibler divergence D KL between to probability mass functions p(x) and q(x) is defined as D KL (p(x) q(x)) = x p(x) log p(x) q(x) The Kullback Leibler divergence is similar to a distance between two distributions - it describes the amount of information is gained by knowing the true distribution of a random variable p instead of assuming its distribution to be q. It is no true distance in the mathematical sense since it doesn t satisfy the triangle inequality. Furthermore, it can be used to measure the information a random variable contains about another random variable, their mutual information. Definition 3.5 The mutual information I(X; Y ) between two random variables X and Y is defined as the Kullback Leibler divergence between the corresponding joint distribution p(x, y) and the product of the respective distributions p(x) and p(y) and is thus I(X; Y ) = D KL (p(x, y) p(x)p(y)) = x,y p(x, y) log p(x, y) p(x)p(y) In other words, mutual information measures how much knowing the value of one random variable reduces the uncertainty about another random variable. Mutual information is symmetric and can be reformulated as (3.3) (3.4) (3.5) I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) (3.6) Lastly, there are two theorems about the quantities introduced above that will be important later in this work. These properties can be derived through use of Jensen s inequality.

29 29 Theorem 3.1 (Jensen s inequality) If f is a convex function and X a random variable, then Ef(X) fe(x). (3.7) From Jensen s inequality and the fact that x log x is convex for x 0 follows Gibb s inequality. Theorem 3.2 (Gibb s inequality) Let p(x) and q(x) be two probability mass functions, Then D KL (p q) 0 (3.8) with equality if and only if p(x) = q(x) x. Since mutual information can be formulated as Kullback Leibler divergence, it is never negative and is 0 when X and Y are independent, or p(x)p(y) = p(x, y). The next theorem establishes the upper bound of the entropy and the maximum entropy distribution. Theorem 3.3 H(X) log X with equality if and only if X is uniformly distributed. X denotes the cardinality of the sample space of X. For any distribution p(x) and the uniform distribution u(x) = 1 X, the Kullback Leibler divergence is D KL (p u) = x log p(x) p(x) = log X H(X). (3.9) log q(x) Taking into account Gibb s inequality, the uniform distribution has maximum entropy and is the only such distribution. To sum up the boundaries of entropy and Kullback Leibler divergence: 0 H(X) log X 0 D KL (p q) log X. The remainder of this section will analyse the properties of Markov processes from the stand point of information theory.

30 30 3 Shannon s Information Theory 3.2 Information Theoretic Properties of Stochastic Processes The Markov decision process as a model for reinforcement learning was introduced in Section 2.2. It was also mentioned that for a fixed policy π, an MDP reduces to a Markov chain. A key property of such Markov chains is the stationary state distribution. This distribution and its requirements will make up the first part of this section, followed by an analysis based on the information theoretic quantities introduced above. Markov chains are stochastic processes (a sequence of random variables) defined by their state transition probabilities p(x t+1 x t ). As can be easily guessed from their name, Markov chains possess the Markov property (see Equation 2.3). The state transition probabilities are time invariant. The probability distribution p t (x) is the distribution over the states of a Markov chain at time t. This distribution evolves over time according to p t+1 (x) = p t (x)p(x t+1 x t ). (3.10) The state distribution p 0 (x) is referred to as the starting distribution. Definition 3.6 (Stationary distribution) The state distribution of a Markov chain is called a stationary distribution µ if µ = µp (3.11) In other words, the stationary distribution does not change over time. There are two conditions a Markov chain needs to satisfy for it to have a unique stationary distribution. Definition 3.7 A Markov chain is said to be irreducible if for any state of the chain every other state is reachable in finite time with positive probability, that is, n > 0 : p(x n = j x 0 = i) > 0 i, j (3.12) Definition 3.8 The period k of a state x of a Markov chain is defined as k = gcd n : p(x n = x x 0 = x) > 0. (3.13) If its period k = 1, a state is called aperiodic. Else, the state is called periodic. If an irreducible Markov chain has one aperiodic state, all of its states are aperiodic, and the Markov chain is called aperiodic. Every irreducible and aperiodic Markov chain has a unique stationary distribution and converges to this stationary distribution for arbitrary starting distributions. If the starting distribution of a Markov chain is stationary, the Markov chain is

31 31 stationary. But since the state distribution of every irreducible and aperiodic Markov chain converges to a stationary distribution, every Markov chain becomes stationary for t. Now that the stationary distribution has been defined, another quantity of Markov chains can be introduced. Definition 3.9 The entropy rate H(X ) of a stochastic process is defined as when the limit exist. 1 H(X ) = lim t t H(X 1, X 2,..., X t ) (3.14) The entropy rate describes how the entropy of the sequence grows over time. This is closely related to H (X ) = lim t H(X t X t 1, X t 2,..., X 1 ). (3.15) For stationary Markov chains, the entropy rate H(X ) and H (X ) are equal and can be easily calculated. Theorem 3.4 For a stationary Markov chain, the entropy rate is H(X ) = x t,x t+1 µ(x t )p(x t+1 x t ) log p(x t+1 x t ) = H(X 2 X 1 ). (3.16) The last equality follows from the Marko property of the Makov chain. entropy rate is the information contained in the transition. In other words, the If the transtion probabilies and the last state of the Markov chain is known, the entropy rate desrcibes about the amount of uncertainy regarding the next state. From here on out, the entropy rate will be denoted with H(X t X t+1 ).

32 32 4 Contrasting Different Notions of Explorative Behaviour 4 Contrasting Different Notions of Explorative Behaviour This section will elaborate more on the role of exploration in classic reinforcement learning, introduce several approaches that use information theory to drive exploration and compare the different notions of exploration. 4.1 Exploration in Reinforcement Learning Section 2 introduced the classic reinforcement learning problem and several approaches to solve it. Underlying all these approaches is the concept of some optimal behavior to be achieved. Optimality is defined with regards to a value function that embodies an expected future return. Since the agent acts in an unknown environment, it has to perform exploration to find out what constitutes optimal behavior in this environment. During exploration, it usually performs nonoptimal. Therefore, the time used for exploration has to be minimized and used efficiently to gain enough knowledge of the environment dynamics. If the agent directly approximates the value function via temporal difference learning, learns a model it can use to plan via RMAX or E 3 or tries to optimally balance exploration and exploitation with regards to some prior about the model via Bayesian reinforcement learning does not matter, the outcome is always supposed to be optimal behavior. Frameworks such as PAC-MDP or the boundaries presented by Kolter and Ng give gurantees that the time spent performing non-optimal is limited with some probability. Thus, exploration can be seen as something that has to be done to enable the agent to do what it is supposed to do, i.e. accumulate as much reward as possible. This is reffered to as the exploration-exploitation tradeoff. Exploration can either be induced through selecting non-optimal actions with a certain probability, e.g. ɛ-greedy or Boltzmann action selection, the optimism in the face of uncertainty concept as embodied by RMAX or E 3 or by incoorparating the information gain into the expected return through the Bayesian formulation of the value. All of these approaches assume a stationary, task-depended reward function. When approximating a value function, as in temporal difference learning, the approxmiation converges over time, and after a certain amount of sampling the policy is, neglecting the noisy action selection, also fixed and optimal. With optimism in the face of uncertainty, exploration can and does end

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and