Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning

Size: px
Start display at page:

Download "Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning"

Transcription

1 Freie Universität Berlin Fachbereich Mathematik und Informatik Master Thesis Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning Michael Borst Advisor: Prof. Dr. Marc Toussaint Berlin,

2

3 Abstract Exploration is a key part of reinforcement learnning. In the classic setting, autonomous agents are supposed to learn a model of their environment to succesfuly complete a task. Recent works in the field and in related fields have suggested the use of quantities based on Shannon s information theory to enable agents to do so. The underlying concepts of exploration vary between those works. In this thesis, these different notions of exploration will be introduced and compared. Further, two algorithms based on established dynamic programming methods are introduced to maximize two information theoretic quantities, the entropy of the state distribution and predictive information, a quantity relating the past and the future of the agent. These algorithms are evaluated in two settings: planning with the true world model and in interaction with the environment without prior knowledge. Entropy maximation proved to be possible in both settings while predictive information maximization was only succesful in the first. The behavior resulting from maximizing these quantities is also analyzed. 3

4

5 Contents 1 Introduction Outline Reinforcement Learning The Reinforcement Learning Problem Markov Decision Processes Environments Planning under Uncertainty An Overview of Reinforcement Learning Methods Sample complexity and PAC-MDP Temporal Difference Learning Model-Based Learning Bayesian Reinforcement Learning Shannon s Information Theory Fundamental Quantities of Shannon s Information Theory Information Theoretic Properties of Stochastic Processes Contrasting Different Notions of Explorative Behaviour Exploration in Reinforcement Learning Information Theoretic Measures in Reinforcement Learning Information Gain Entropy Predictive Information Similarities and Differences Maximizing Information Theoretic Quantities Information Theoretic Quantities in MDPs The State Distribution State Entropy Predictive Information Modification of the Standard Methods Reward Functions

6 5.2.2 Q-Iteration Policy Iteration Action Selection Evaluation of Planning Algorithms Entropy Maximization Predictive Information Maximization Discussion Reinforcement Learning Evaluation Model Accuracy Information Theoretic Quantities Discussion Conclusion Future Research Bibliography 67 Declaration of Academic Integrity 70 6

7 7 1 Introduction The idea of intelligent machines has a long history. Alan Turing, one of the founding fathers of modern computer science, already entertained this idea in his seminal article Computing Machinery and Intelligence in 1950 [24]. In this article, he not only introduced the Turing test to assess the intelligence of a machine, he also suggested machines could learn by trial and error. This approach, termed reinforcement learning, has been applied to a vast array of problems by now, from scheduling [29] to autonomous helicopter flight [12]. In the classic reinforcement learning setting, an agent has to learn how to accomplish a task in an unknown environment by interaction with this environment with only a reward signal as guidance. This reward signal, and thus, the task, does not originate from within the agent, it is an external motivation. Everything the agent learns is dedicated to solving that task. But what if the agent does not have such a specific purpose? What if it is supposed to just familiarize itself with its environment, to learn what can be done within this evironment or to just generally behave in a certain fashion? The motivation has to be relocated, it has to become an intrinsic drive to learn or to act. Recent work in this respect has applied Shannon s information theory to create intrinsic motivation ([17], [1], [28], [16], [23], [11]). While the intentions behind these applications vary, all share the use of information therotic quantities to derive an intrinsic motivation for the agent to behave in a certain way. This thesis will give an overview of the information theoretic approaches to reinforcemt learning focused on those which try to induce learning or explorative behavior in the agent, elaborate on their intentions and compare them to classic reinforcement learning. Further, it will present modifications of existing algorithms to maximize information theoretic quantities, namely entropy and predictive information, the latter inspired by [1]. It will evaluate the ability of these algorithms to do so with a model of the environment available and without such a model. 1.1 Outline The remainder of this work is structured as follows: Section 2 will introduce the basic reinforcement learning problem, a model to formalize it, and ways to solve it. Section 3 will familiarize the reader with Shannon s information theory and its application to stochastic processes. Section

8 8 1 Introduction 4 will elaborate more on the role of exploration in classic reinforcement learning followed by a presentation of different works using information theory in the context of reinforcement learning and a comparison of both. Section 5 will then present algorithms that maximize entropy and predictive information. These algorithms will be evaluated as planning algorithms given a model of the environment in Section 6 and in an unknown environment in Section 7. The thesis will be finalized by a summary and an outlook over possible future work in Section 8.

9 9 2 Reinforcement Learning This section will introduce the reinforcement learning problem, the Markov decision process as formalization of this problem, basic methods of determining optimal behavior given the required knowledge, and methods that both acquire this knowledge and enable optimal behavior. The first three sections are based on the intrdouction to reinforcement learning by Sutton and Barto [22] and the reader is referred to this book for a more detailed introduction. 2.1 The Reinforcement Learning Problem Reinforcement learning is a subfield of machine learning that is concerned with learning how to achieve goals through interaction. A decision maker, called agent, is set in an unfamiliar environment. The agent is able to observe the state s t of the environment at a time t through some kind of sensor. In reaction to this state, it can interact with the environment by performing an action a t. For this action, it receives a reward r t+1 and then transitions into a new state s t+1. One interaction is therefore defined by the 4-tuple (s t, a t, s t+1, r t+1 ) 1. This interface for the agent-environment interaction is illustrated in Figure 2.1. The transitions from state to state Figure 2.1: The agent-environment interface [22]: the agent observes state s t and reacts with action a t. This results in a reward signal r t+1 and a new environment state s t+1. via actions are stochastic, i.e., the successor state of an action in a state is not always the same. The agent s goal is to maximize the cumulated reward over time, or return R T R T = r t+1 + r t+2 + r t r T = r i (2.1) i=0 1 Where unambiguous, s is used fo s t+1, s for s t+1, a for a t and r for r t+1.

10 10 2 Reinforcement Learning where T is the final time step. If T is finite, one speaks of a finite horizon setting. For T =, this definition is problematic, since the return is no longer guaranteed to be finite. Therefore, for tasks that do not have clearly definied stopping point, discounting is used. So, in the infinite horizon setting, the discounted return is R = r t+1 + γr t+2 + γ 2 r t+3 + = γ i r t+i+1 (2.2) where 0 γ 1 is a discount rate. The discount rate determines how much immediate reward is prefered over future reward; the lower the rate, the less future reward is considered. To mazimize its return, the agent needs to have knowledge about the properties of its environment. Therefore, it needs to gather information about states and actions it hasn t seen (often enough) yet by sampling them. But doing so means disregarding actions that are considered more rewarding given the current knowledge of the agent. This dilemma is known as the exploration-exploitation tradeoff and is one of the central problems in reinforcement learning. Every algorithm must present some kind of solution to this problem. The reinforcement learning problem differs from other problems in machine learning. In contrast to supervised learning, where there is a set of pairs of input and desired output, the agent always has to produce the learning data through its own behavior. But unlike unsupervised learning, there is feedback which the agent can use to adapt its behavior to the environment. In the next section, Markov decision processes will be introduced as a formal model for reinforcement learning. i=0 2.2 Markov Decision Processes The most common formal model for reinforcement learning is the Markov decision process (MDP). An MDP is a discrete-time stochastic control process and is defined by the 4-tuple (S, A, p, R) with the set of states S, the set of actions A, the state transition distribution p(s s, a), s, s S, a A and the reward function or signal R : S A S R.

11 11 The reward function is bounded. If both the set of states and the set of actions are finite, the MDP is called a finite MDP. From here on in, every MDP is considered a finite MDP. Furthermore, the state transition probability distribution is considered to be stationary, that means to be conditionally independent of time. An important property of MDPs is the Markov property. Definition 2.1 A stochastic process possesses the Markov property if p(x n+1 = x n+1 X n = x n, X n 1 = x n 1,... X 0 = x 0 ) = p(x n+1 = x n+1 X n = x n ) (2.3) In other words, the future of the MDP depends solely on the present and not on the past. Therefore, the agent only has to take the current state into account when selecting an action. What action a the agent selects in reaction to a state s depends on his policy π. Policies can be either deterministic or stochastic. In this work, deterministic policies will be denoted as π(s) and stochastic policies as π(a s). Policies are stationary Environments There are three environments that will be used in this work. They will now be presented for later use. The first environment will be referred to as dense world. This environment has the same number of actions and states. The actions have the following transition probabilities: i, j, k [1, S ] : p(s k s i, a j ) = p 1 p S 1 if j = k else where p is usually significantly bigger than 0.5, e.g. towards one succesor state. p = 0.8, so that every action is biased The second environment consists of an arbitrary number of states and two actions and is referred

12 12 2 Reinforcement Learning to as circle world. The circle world has the following transition probabilities: p if j = i + 1 or i = S and j = 1 p(s j s i, a 1 ) = 1 p if j = i 0 else i, j [1, S ] : p if j = i 1 or i = 1 and j = S p(s j s i, a 2 ) = 1 p if j = i 0 else where p is again significantly bigger than 0.5. This environment is called circle world because if the agent always choses action a 1 or always choses a 2, it will have a trajectory resembling a circle. The third environment is referred to as grid world. It is a two dimensional environment that consists of an arbitrary number of states and four actions. The states have coordinates (x, y). The four actions can be interpreted as east (a e ), west (a w ), north (a n ) and south (a s ). As one would expect, performing one of this actions is biased to lead to an one-step increase in x, decrease in x, increase in y or decrease in y, respectively, with probability p. The probability 1 p is equally distributed among the states in the other directions if they exist, so for a state that has neighboring in three other directions, the probability to end up in these states instead of the state towards which the action is based is 1 p 3 1 p, for two other directions it is 3, and for one it is 1 p. If there is no state in the direction the action is biased towards, the agent simply stays in its current state with probability p and probability 1 p is equally divided among the available neighboring states. The parameter p is usually global, though there will be a variant of this environment that uses a random probability p for each state-action pair and has selected state-action pairs not biased towards one succesor state. 2.3 Planning under Uncertainty Having defined a formal model of the environment, the following section will elaborate more on finding a policy that maximizes return when given an MDP, or solving the MDP. Since the outcome of actions in the environment is stochastic, this is referred to as planning under uncertainty. The methods presented here are cases of dynamic programming, a concept invented by Richard Bellman [2]. First, a more precise formulation of the return in an MDP is needed. When following a fixed

13 13 policy π and starting in a state s, the expected discounted return or (state) value V π (s) of a state s is V π (s) = E π {r 1 + γr 2 + γ 2 r s 0 = s}. (2.4) The value enables the agent to assess the desirability of being in a certain state. The value can be reformulated through its recursive property so that it becomes more explicit: V π (s) = E π {r 1 s 0 = s} + γe π {r 2 + γr s 1 = s } = R(s, π(s), s ) + γ p(s s, π(s))e π {r 2 + γr s 1 = s } s = s p(s s, π(s)) [R(s, π(s), s ) + γv π (s )] (2.5) or in case of a stochastic policy as V π (s) = a π(a s) s p(s s, a) [R(s, a, s ) + γv π (s )]. (2.6) Alternatively, when the agent performs action a in state s and follows a fixed policy π thereafter, the state-action value Q π (s, a) is given by Q π (s, a) = s p(s s, a) [R(s, a, s ) + γq π (s, π(s ))] (2.7) or in case of a stochastic policy as Q π (s, a) = [ p(s s, a) R(s, a, s ) + γ ] π(a s )Q π (s, a ) s a (2.8) Both values relate states to the expected discounted return by weighted propagation of future rewards through the possible sequences (s 0, a 0, r 1, s 1, a 1, r 2,... ). State value and state-action value are interchangeable, since V π (s) = Q π (s, π(s)) or V (s) = a policies. π(a s) Q π (a, s) for stochastic Equations 2.5 and 2.7 are the Bellman equations for the state and state-action value function. Iterating either V π or Q π leads to convergence to the value or state-action value function for the corresponding policy π. This application of dynamic programming is called policy evaluation. Based on the value function, optimality can be easily defined. A policy is optimal if it maximizes the state value in every state: s S : V π (s) = V (s) where V (s) = max π V π (s) (2.9)

14 14 2 Reinforcement Learning For every MDP there exists at least one deterministic policy that is optimal with regards to the value. Bellman s principle of optimality makes it possible to find this optimal policy. This principle states that for any initial state, a policy that selects the action that maximizes the value of the state and from there on out is equal to the optimal policy is optimal. The Bellman optimality equation for the value is V (s) = max a [ s p(s s, a) (R(s, a, s ) + γv (s )) ] (2.10) with the corresponding optimal policy [ ] π (s) = argmax a s p(s s, a) (R(s, a, s ) + γv (s )) Iterating the equation V k+1 (s) = max a [ s p(s s, a) (R(s, a, s ) + γv k (s )) ] (2.11) (2.12) for all states is called value iteration and lets V k converge to V, which implicitly contains the optimal policy via Equation The Bellman optimality equation for state-action value function is Q (s, a) = [ ] p(s s, a) R(s, a, s ) + γ max Q (s, a ) a s (2.13) with the corresponding optimal policy π (s) = p(a s) = argmax Q (s, a). (2.14) a Repeated application of Q k+1 (s, a) = [ ] p(s s, a) R(s, a, s ) + γ max Q k (s, a ) a s (2.15) for all states and actions is called Q-Iteration and converges to Q (s, a). Q-Iteration contains the optimal policy explicitly, since the agent only has to chose the action with the highest state-action value. There exists another method to solve an MDP. It consists of two steps: 1. Evaluate policy π to obtain Q π (s, a).

15 15 2. Select a new policy π(s) = argmax a Q π (s, a). These two steps are applied in turns until the policy stops changing. This method is called policy iteration. These planning algorithms enable the agent to perform exploitation easily if the true world model is known. But the agent doesn t actually have this model. The algorithms presented in the following section will show ways of dealing with this situation, or how to learn. 2.4 An Overview of Reinforcement Learning Methods As explained earlier, planning how to maximize return when the dynamics of the environment are known is not the reinforcement learning problem. Instead, the agent has no knowledge about its environment at the beginning of its task and has to autonomously gather it. The agent has to perform exploration. The difficulty is that the two things the agent has to do - exploration and exploitation - are somewhat opposed: at any given time, the agent can either choose to try to increase its knowledge or to use it to obtain reward. Some way of dealing with this so-called exploration-exploitation tradeoff is needed. In reinforcement learning, there are three basic methods of learning. Model-free learning, where some value function is approximated, is represented here through temporal difference learning and introduced first. Second, there are model-based algorithms which try to efficiently learn a model of the system and then solve this model through dynamic programming. These are represented through RMAX approach embedded in that framework. Finally, another modelbased solution is introduced: Bayesian reinforcement learning, in which a Bayesian optimal policy is approximated. The last learning method is policy search, where a policy is directly learned from the data, e.g. through policy gradients. This group of methods will not be introduced here. Before these methods are introduced, several measures of complexity for reinforcement learning algorithms will be defined, along with a framework for assessing efficiency Sample complexity and PAC-MDP There are three relevant measures of complexity in reinforcement learning. Computational complexity is the amount of time the agent needs to perform the required computations for each time step. Space complexity is the amount of memory the agent needs to store the required

16 16 2 Reinforcement Learning information for his computations, e.g. an agent would need at least S A entries to store the state-action value function. The third quantity is more complex. Definition 2.2 (Sample complexity) Let c = (s 0, a 0, r 1, s 1, a 1,..., a t 1, r t, s t ) be a random path generated by executing an algorithm A in an MDP M. For any fixed ɛ > 0, the sample complexity of exploration of A is the number of timesteps t such that the policy at time t, A t, satisfies V At (s t ) < V (s t ) ɛ. Sample complexity [8] is the amount of timesteps for which the algorithm s, or the agent s, return is more than ɛ worse than the optimal return. Put differently, it is the amount of samples needed for the agent to perform sufficiently well. Sample complexity is rather important, since in real reinforcement learning settings, sampling can take a lot of time. Based on these three complexity measures, Strehl et al. [18]. introduced the PAC-MDP concept Definition 2.3 (PAC-MDP) An algorithm A is said to be an efficient PAC-MDP (Probably Approximately Correct in Markov Decision Processes) algorithm if, for any ɛ > 0 and 0 < δ < 1, the per-timestep computational complexity, space complexity, and the sample complexity of A are less than some polynomial in the relevant quantities (S, A, 1/ɛ, 1/δ, 1/(1 γ)), with probability at least 1 δ. It is simple PAC-MDP if the definition is relaxed to have no computational complexity requirement. An efficient PAC-MDP algorithm thus performs sub-optimal only in a number of time steps that is polynomial in the mentioned quantities. PAC-MDP is an important formal framework for efficiency in reinforcement learning. It is also used to derive upper and lower bounds for these algorithms. There are model-free as well as model-based efficient PAC-MDP algorithms Temporal Difference Learning Temporal difference learning was introduce by Sutton in 1988 [21]. In this form of learning, no model of the dynamics of the environment, is maintained, which is why they are called modelfree. Instead, a value function is learned. Specifically, temporal difference methods iteratively improve an approximation of the value function using a previous estimate of it, a concept called bootstrapping. In the simplest form, TD(0), the approximation s updated according to the

17 17 update step V (s t ) V (s t ) + α[r t+1 + γv (s t+1 V (s t )]. (2.16) This can be online done after every experience and converges to an optimal estimation of the value function in the sense that it would be correct for the maximum-likelihood model of the corresponding Markov process. The value function approximated is of course the value function of the policy that the agent follows. To get to an algorithm that learns the optimal value function V and therefore is able to successfully exploit, some modifications have to be made. Two such modifications shall be discussed here, SARSA and Q-learning. Algorithm 2.1: SARSA 1 Initialize Q(s, a) arbitrarily 2 Draw start state s 0 3 Choose action a 0 ɛ argmax a Q(s 0, a) 4 for t = 0, 1, 2, 3,... do 5 Execute a t, observe r t+1, s t+1 6 Choose action a t+1 ɛ argmax a Q(s t+1, a) 7 Q(s t, a t ) Q(s t, a t ) + α [r t+1 + γq(s t+1, a t+1 ) Q(s t, a t )] 8 end SARSA [14] is an on-policy temporal difference learning algorithm. On-policy means that, similar to policy iteration, it follows a policy π and approximates the corresponding state-action value function Q π (s, a) while simultaneously optimizing the policy with respect to Q π. Algorithm 2.1 shows a concrete version of SARSA with ɛ-greedy action selection. A 5-tuple (s t, a t, r t+1, s t+1, a t+1 ) - hence the name SARSA - is generated by the agent by selecting action a t in the current state s t, obtaining some reward r t+1, observing the follow-up state s t+1 and choosing another action a t+1. Actions are selected with a method that incorporates some exploration mechanism, explained in detail late. The sample is than used to update the estimate of the state-action value function according to Q(s t, a t ) Q(s t, a t ) + α [r t+1 + γq(s t+1, a t+1 ) Q(s t, a t )]. (2.17) Note that the tuples used in adjacent updating steps overlap, meaning that the first action for the next step is already selected in the current step and before the state-action value function is updated. This is why SARSA is an on-policy algorithm. Sutton and Barto mention in their

18 18 2 Reinforcement Learning book that convergence to an optimal function is ensured if all state-action pairs are sampled an infinite number of times and the policy converges to a greedy policy, this will be referred to later. Q-learning, on the other hand, is an off-policy temporal difference learning algorithm introduced by Watkins in 1989 [25]. It directly approximates the optimal state-action value function Algorithm 2.2: Q-Learning 1 Initialize Q(s, a) arbitrarily 2 Draw start state s 0 3 for t = 0, 1, 2, 3,... do 4 Choose action a t ɛ argmax a Q(s t, a) 5 Execute a t, observe r t+1, s t+1 6 Q(s t, a t ) Q(s t, a t ) + α [r t+1 + γ max a Q(s t+1, a) Q(s t, a t )] 7 end Q π (s, a). A concrete version is formulated in Algorithm 2.2, again with ɛ-greedy action selection. The difference in the value update equation [ Q(s t, a t ) Q(s t, a t ) + α r t+1 + γ max a ] Q(s t+1, a) Q(s t, a t ) (2.18) in comparison to SARSA is that the follow-up action a t+1 = argmax a Q(s t+1, a) is not chosen according to a policy but greedily to maximize the future value, it is chosen off-policy. In this regard, it is closer to value iteration and follows the notion of Bellman s principle of optimality - if an action maximizing the value in the current state is chosen and the agent follows an optimal policy from there on out, the agent maximizes its return. In 1992, convergence to the optimal value function was proven [26]. Both algorithms address the exploration-exploitation tradeoff through the action selection mechanism they use. In the algorithm instances presented here, ɛ-greedy action selection is used, where with probability 1 ɛ, the action a = argmax a Q(s, a) that maximizes the state-action value function is chosen, and a random other action with probability ɛ, with 0 ɛ 1. This ensures exploration in a very simple way, and since every action is sampled infinitely, it helps to prove convergence to the optimal value function. Yet, of course, the agent never maximizes its return since it does not always take the optimal action. The parameter ɛ determines how strong exploration is valued over exploitation. For example, if ɛ = 0.1, the algorithm will converge faster

19 19 than if ɛ = 0.01, but after some time, the return will be higher for the latter, since the maximizing action is chosen more frequently. For ɛ = 1, the agent chooses actions randomly, while for ɛ = 0, the agent never takes a non-optimal action, it acts greedy all the time. Furthermore, for SARSA, the policy has to converge to a greedy policy to ensure convergence to the optimal policy, so ɛ has to decrease over time, e.g. ɛ = 1 t. The problem with this modification is that some knowledge is needed in advance to determine how fast ɛ should decrease. For a very large state space S or action space A, it should obviously decrease slower than if there were only two states and two actions, for example. Another method is softmax or Boltzmann action selection. The action a for a state s with corresponding state-action value function Q(s, a) is selected according to a Boltzmann distribution and thus π(a s) = exp(q(s, a)/τ) a exp(q(s, a)/τ) (2.19) whit temperature τ. For τ, softmax action selection is equivalent to random or uniform action selection, for τ 0 it is equivalent to the greedy action selection. This selection method does not randomly select one non-optimal action with a certain probability, but ranks the actions according to their respective value. Again, the parameter τ can be chosen as a function of time. According to Sutton and Barto, there is no comparative studies about which type of action selection to prefer with regards to performance, but ɛ-greedy is more common because of its more intuitive parameter setting. While computational and space complexity of these model-free approaches is very low compared to the algorithms introduced below, they have a high sample complexity. Furthermore, an algorithm that uses ɛ-greedy exploration can never be an efficient PAC-MDP algorithm, since the sample complexity is exponential in the number of states [27]. Efficient PAC-MDP model-free algorithms exist, though, for example delayed Q-learning [19]. A general problem of model-free approaches is that they learn the value function for one task and one task only. They can not apply the knowledge gained from one task an environment with the same dynamics but a different reward function. Of course they adapt, but they would have to learn a whole new value function insted of just computing it from the known model, like model-based algorithms could. Furthermore, one could argue that they gain no actual knowledge of the world, just about the goodness of actions in states given the task. The value function contains no information about where an action leads the agent or with what probability.

20 20 2 Reinforcement Learning Model-Based Learning In this section, the model-based algorithm RMAX will be introduced as a model-based efficient PAC-MDP algorithm. RMAX was first published in 2001 [4], but the PAC-MDP modification shown in Algorithm 2.3 is from Strehl et al. [18]. RMAX is a manifestation of the principle optimism in the face of uncertainty. Putting it very simple, every state-action pair that has not been sampled enough times is considered good and therefore rewarded with maximum reward, hence the name RMAX. This is how RMAX encourages exploration. From the samples, a model is built and used to plan to find an optimal policy. To understand why such a simple principle is considered a good algorithm, a closer look is needed. An RMAX agent starts with an optimistic initial state-value function U(s, a) that is constant for all state-action pairs and guaranteed to be an upper bound of the true value function. The agent s sample counters n(s, a, s ), n(s, a) = s n(s, a, s ) and r(s, a, s ) (sum of rewards obtained for performing a in s and reaching s ) are set to zero. It then starts interacting with the environment and updates its counters. As soon as it has seen any state-action pair (s, a) m times, it uses the counters to estimate the transition probabilities ˆp(s s, a) = n(s, a, s ) n(s, a) s S and the reward function ˆR(s, a, s ) = r(s, a, s ) n(s, a, s ) s S with maximum likelihood. The partial model made up from these estimations is then used to plan, in this case via value iteration. Note that the state-action value is only updated for all known pairs and only if a new pair becomes known, yet the value of unknown pairs can be part of the value update as future value. This is important, since no exploration would happen otherwise. As mentioned before, RMAX is an efficient PAC-MDP algorithm if the right values parameters c and m are chosen. For a full analysis, the reader is referred to the original papers. E 3 (Explicit Exploit or Explore), another model-based efficient PAC-MDP algorithm that uses the optimism in the face of uncertainty principle, was published by Kearns and Singh in 2002 [9]. An E 3 agent maintains two models. One is the maximum likelihood estimate M k nown including all known states. M unknown, the other one, consists of all the known states with

21 21 Algorithm 2.3: R-MAX 1 foreach (s, a) S A do 2 Q(s, a) U(s, a) 3 n(s, a) 0 4 foreach s S do 5 n(s, a, s ) 0 6 r(s, a, s ) 0 7 end 8 end 9 Draw start state s 0 10 for t = 0, 1, 2, 3,... do 11 Choose action a t ɛ argmax a Q(s 0, a) 12 Execute a t, observe r t+1, s t+1 13 if n(s t, a t ) < m then 14 n(s t, a t ) n(s t, a t ) r(s t, a t, s t+1 ) r(s t, a t, s t+1 ) + r t+1 16 n(s t, a t, s t+1 ) n(s t, a t, s t+1 ) if n(s t, a t ) = m then 18 for i = 1, 2, 3,..., c do 19 foreach ( s, ā) do 20 if n( s, ā) m then 21 Q( s, ā) [ ] s ˆP (s s, ā) ˆR( s, ā, s ) + γ max a Q(s, a ) 22 end 23 end 24 end 25 end 26 end 27 end

22 22 2 Reinforcement Learning the same dynamics but zero reward for known states and a fictitious state to which all unknown transitions lead and which has maximum reward. When in a known state, both MDPs are solved. If the value for the policy resulting from planning in M known is high enough, the agent follows this policy, it exploits. Else, it follows the policy derived from solving M u nknown, resulting in planned exploration. If the state the agent is in is unknown, it performs the action it has performed the fewest times. The basic concepts to take away from this section are optimism in the face of uncertainty and PAC-MDP efficiency. The agent believes that actions it doesn t know are actions that will lead to reward and under this assumption it is able to perform approximately optimal with a certain probability of error in polynomial time with regards to the required proximity to optimality, accepted probability of error and parameters of the model of the environment. This is important knowledge, although the number of samples needed to achieve sufficient proximity to optimal behavior with a satisfactory probability is very high Bayesian Reinforcement Learning Bayesian Reinforcement Learning is a solution to the exploration-exploitation tradeoff that is different from the ones introduced above. In the Bayesian reinforcement learning setting, the uncertainity about the model is explicitly modeled - the agent maintains a belief b over the model. This belief is incorporated into the value function, which leads to the Bellman equation for the Bayesian state value V π (b, s) = p(b, s s, a, b) [R(s, a, s ) + αv π (b, s )] (2.20) b,s where a = π(b, s). In a discrete environment, the belief can be easily done through a set of Dirichlet distributions b = {α(s, a, s )}, p(s b, s, a) = α(s, a, s ) α 0 (s, a) where α(s, a, s ) is simply a counter for the number of samples (s t = s, a t = a, s t+1 = s ) and α 0 = s α(s, a, s ). The counters are initialized so that they represent a prior over the model. For every experience, the agent then increments the corresponding counter and thus, obtains a new belief over the model. Under these assumptions, Equation 2.20 can be simplified to V π (b, s) = s p(s b, s, a) [R(s, a, s ) + αv π (b, s )] (2.21)

23 23 because the new belief b follows deterministically from the current one and the experience for the given belief update rule. Bellman s optimality equation for the Bayesian value follows from Equation 2.21 by selecting the actions maximizing the value and is { } V (b, s) = max p(s b, s, a) [R(s, a, s ) + αv π (b, s )] a s (2.22) with the Bayesian optimal policy π (s) = argmax π V (b, s). Using the Bayesian value function to guide the agent leads to optimal behavior with regards to the prior over the model. So, rather than ensuring exploration through action selection or through rewards given for unknown state-action pairs, Bayesian reinforcement learning explicitly includes the agent s uncertainty over its model directly into the formulation of the expected return. This inclusion naturally induces exploration. If a new experience for a tuple (s, a) might lead to a significantly different model with a higher expected return, that action will be selected. The problem with the Bayesian approach is that, in general, it is not tractable. There are various methods to approximate the optimal Bayesian policy or value function ([7], [6], [20], [13]), the approach presented here is chosen because it provides formal guarantees similar to PAC-MDP. The Bayesian Exploration Bonus (BEB) algorithm introduced by Kolter and Ng [10] defines the optimal value Ṽ H (b, s) over the next H time steps as { where Ṽ H(b, s) = max a R(s, a) + β 1 + α 0 (s, a) + s p(s b, s, a)ṽ H 1 } (2.23) β 1+α 0(s,a) is the Bayesian exploration bonus. Parameter β is of importance for the boundaries presented later. Since this equation does not use the updated belief b, standard dynamic programming can be applied to solve it. The lesser a state-action pair has been sampled, the higher the bonus. It is assumed that the reward function is known in advance, yet this does not affect generality since every MDP with unknown bounded reward function can be remodeled into an MDP with known reward by adding states to it. Kolter and Ng use the finite horizon case because their theorems build on parameter H. The extension to the infinite horizon setting is an open question, but nevertheless this approach introduces an interesting solution to the exploration-exploitation tradeoff. Using this approximation, Kolter and Ng provide the following bound:

24 24 2 Reinforcement Learning Theorem 2.1 Let A t denote the policy followed by the BEB algorithm (with β = 2H 2 ) at time t, and let s t and b t be the corresponding state and belief. Also suppose we stop updating the belief for a state-action pair when α 0 (s, a) > 4H 3 /ɛ. Then with probability at least 1 δ, V At H (b t, s t ) VH(b t, s t ) ɛ for all but time steps. m = O( S A H6 ɛ 2 log S A ) δ So, similiar to PAC-MDP, a BEB agent is guaranteed to act sub-optimal only for a polynomial number of time steps. In fact, the bound is tighter than the PAC-MDP sample complexity m = Õ( S 2 A H 6 ɛ 3 ) This bound are with regard to closeness to two different optimal value functions, though, the optimal Bayesian value function and the optimal value function for some given model. Bayes optimality requires less exploration because for a sufficiently certain transition probability, the expected return does not change significantly when the model is updated. formalized by the following theorem: This intuition is Theorem 2.2 Let A t denote the policy followed by an algorithm using any exploration bonus that is upper bounded by β n(s, a) p for some constant β amd p > 1/2. THen there exists some MDP M and ɛ 0 (β, p), such that with probability greater than δ 0 = 0.15, V At H (s t) < V H(s t ) ɛ 0 will hold for an unbounded number of time steps. In other words, any algorithm with an exploration bonus that decays faster than 1/ n, such as BEB, can not be PAC-MDP and may no find an optimal policy with regards to the state value. Bayesian reinforcement learning in general introduces the uncertainty over the model (or the models in other Bayesian approaches) explicitly into the prediction of return. The BEB alogrithm introduced here has lower sample complexity than the most efficient PAC-MDP algorithm. This

25 25 steems from the smaller amount of exploration needed to be close to Bayes optimality. It remains to be seen if it can be applied to the infinite horizon case.

26 26 3 Shannon s Information Theory 3 Shannon s Information Theory When Shannon defined entropy 1948, it was part of his attempt to establish A Mathematical Theory of Communication [15]. This theory was supposed to find boundaries of information compression and transmission of such information. Therefore, he was looking for a measure of the rate at which an information source produces information. More general, the quantity was supposed to describe the uncertainty over the occurrence of one out of several possible events, or the inherent complexity of the process underlying these occurrences. It turned out that Shannon s work was the key concept for a theory that has far more possible fields of application than just communication theory - information theory. Cover and Thomas [5] give an extensive overview of the application of Shannon s information theory which ranges from the original field of communication theory to computer science (where entropy is approximately equal to Kolmogorov complexity, the minimal description length of a data sequence), statics and economics. This work draws heavily on Cover s work and the reader is referred to it for proofs and deeper insights into the field of information theory. In the remainder of this section, the fundamental quantities of Shannon s information theory and their properties will be introduced, followed by an analysis of Markov processes from the perspective of this theory for use in later sections of this work. From here on out, the term information theory will refer to the theory built around Shannon s entropy concept and its properties. 3.1 Fundamental Quantities of Shannon s Information Theory At the core of information theory lies the entropy H(X) of a discrete random variable X with probability distribution p(x). Definition 3.1 The entropy H(X) of a discrete random variable X is defined by H(X) = x p(x) log p(x). (3.1) It is common to use the convention 0 log 0 = 0. Since entropy was first introduced in communication theory, it is usually expressed in bits and the logarithm is to the base of 2. It can be interpreted as the average number of bits needed to describe the random variable, a measure of

27 H(X) p Figure 3.1: The entropy H(X) of the Bernoulli distribution. It is 0 for p = 1 or 0 and maximum for p = 0.5. its uncertainty or the expected information gained by knowing the value of it. The entropy of a random variable and its distribution are the same, H(X) = H(p(x)). Example 3.1 Consider a random variable X with a Bernoulli distribution, that is, 1 with probability p X = 0 with probability p - 1 The plot of H(X) is shown in Figure 3.1. For p = 0 or 1, the entropy is 0 - the value is known in advance, there is no uncertainty. For p = 1 p = 0.51, the entropy is maximum - the value can only be guessed, there is nothing but uncertainty. As can be expected from the various interpretations, the entropy is never negative - no information can be lost through knowing the outcome of a event. Lemma 3.1 H(X) 0. This follows easily from 0 p(x) 1 log p(x) > 0 and the fact that a weighted sum of positive values will always be positive. It is 0 when there is no uncertainty or 1 x X : p(x = x) = 1. Since p(x) is a probability distribution, it follows that the probability for all other values is 0 and the entropy H(X) = 1 log 1 = 0. Definition 3.1 can be easily extended to the entropy of a joint distribution.

28 28 3 Shannon s Information Theory Definition 3.2 The joint entropy H(X, Y ) of two discrete random variable X, Y is defined by H(X, Y ) = x,y p(x, y) log p(x, y) (3.2) The entropy of a conditional distribution comes similarly natural. Definition 3.3 The conditional entropy H(X Y ) of two discrete random variable X, Y is defined by H(X Y ) = x,y p(x, y) log p(x, y) p(y) The measures introduced up until now allow to describe random variables in terms of the information they contain. But how can different distributions about a random variable be compared? Definition 3.4 The Kullback Leibler divergence D KL between to probability mass functions p(x) and q(x) is defined as D KL (p(x) q(x)) = x p(x) log p(x) q(x) The Kullback Leibler divergence is similar to a distance between two distributions - it describes the amount of information is gained by knowing the true distribution of a random variable p instead of assuming its distribution to be q. It is no true distance in the mathematical sense since it doesn t satisfy the triangle inequality. Furthermore, it can be used to measure the information a random variable contains about another random variable, their mutual information. Definition 3.5 The mutual information I(X; Y ) between two random variables X and Y is defined as the Kullback Leibler divergence between the corresponding joint distribution p(x, y) and the product of the respective distributions p(x) and p(y) and is thus I(X; Y ) = D KL (p(x, y) p(x)p(y)) = x,y p(x, y) log p(x, y) p(x)p(y) In other words, mutual information measures how much knowing the value of one random variable reduces the uncertainty about another random variable. Mutual information is symmetric and can be reformulated as (3.3) (3.4) (3.5) I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) (3.6) Lastly, there are two theorems about the quantities introduced above that will be important later in this work. These properties can be derived through use of Jensen s inequality.

29 29 Theorem 3.1 (Jensen s inequality) If f is a convex function and X a random variable, then Ef(X) fe(x). (3.7) From Jensen s inequality and the fact that x log x is convex for x 0 follows Gibb s inequality. Theorem 3.2 (Gibb s inequality) Let p(x) and q(x) be two probability mass functions, Then D KL (p q) 0 (3.8) with equality if and only if p(x) = q(x) x. Since mutual information can be formulated as Kullback Leibler divergence, it is never negative and is 0 when X and Y are independent, or p(x)p(y) = p(x, y). The next theorem establishes the upper bound of the entropy and the maximum entropy distribution. Theorem 3.3 H(X) log X with equality if and only if X is uniformly distributed. X denotes the cardinality of the sample space of X. For any distribution p(x) and the uniform distribution u(x) = 1 X, the Kullback Leibler divergence is D KL (p u) = x log p(x) p(x) = log X H(X). (3.9) log q(x) Taking into account Gibb s inequality, the uniform distribution has maximum entropy and is the only such distribution. To sum up the boundaries of entropy and Kullback Leibler divergence: 0 H(X) log X 0 D KL (p q) log X. The remainder of this section will analyse the properties of Markov processes from the stand point of information theory.

30 30 3 Shannon s Information Theory 3.2 Information Theoretic Properties of Stochastic Processes The Markov decision process as a model for reinforcement learning was introduced in Section 2.2. It was also mentioned that for a fixed policy π, an MDP reduces to a Markov chain. A key property of such Markov chains is the stationary state distribution. This distribution and its requirements will make up the first part of this section, followed by an analysis based on the information theoretic quantities introduced above. Markov chains are stochastic processes (a sequence of random variables) defined by their state transition probabilities p(x t+1 x t ). As can be easily guessed from their name, Markov chains possess the Markov property (see Equation 2.3). The state transition probabilities are time invariant. The probability distribution p t (x) is the distribution over the states of a Markov chain at time t. This distribution evolves over time according to p t+1 (x) = p t (x)p(x t+1 x t ). (3.10) The state distribution p 0 (x) is referred to as the starting distribution. Definition 3.6 (Stationary distribution) The state distribution of a Markov chain is called a stationary distribution µ if µ = µp (3.11) In other words, the stationary distribution does not change over time. There are two conditions a Markov chain needs to satisfy for it to have a unique stationary distribution. Definition 3.7 A Markov chain is said to be irreducible if for any state of the chain every other state is reachable in finite time with positive probability, that is, n > 0 : p(x n = j x 0 = i) > 0 i, j (3.12) Definition 3.8 The period k of a state x of a Markov chain is defined as k = gcd n : p(x n = x x 0 = x) > 0. (3.13) If its period k = 1, a state is called aperiodic. Else, the state is called periodic. If an irreducible Markov chain has one aperiodic state, all of its states are aperiodic, and the Markov chain is called aperiodic. Every irreducible and aperiodic Markov chain has a unique stationary distribution and converges to this stationary distribution for arbitrary starting distributions. If the starting distribution of a Markov chain is stationary, the Markov chain is

31 31 stationary. But since the state distribution of every irreducible and aperiodic Markov chain converges to a stationary distribution, every Markov chain becomes stationary for t. Now that the stationary distribution has been defined, another quantity of Markov chains can be introduced. Definition 3.9 The entropy rate H(X ) of a stochastic process is defined as when the limit exist. 1 H(X ) = lim t t H(X 1, X 2,..., X t ) (3.14) The entropy rate describes how the entropy of the sequence grows over time. This is closely related to H (X ) = lim t H(X t X t 1, X t 2,..., X 1 ). (3.15) For stationary Markov chains, the entropy rate H(X ) and H (X ) are equal and can be easily calculated. Theorem 3.4 For a stationary Markov chain, the entropy rate is H(X ) = x t,x t+1 µ(x t )p(x t+1 x t ) log p(x t+1 x t ) = H(X 2 X 1 ). (3.16) The last equality follows from the Marko property of the Makov chain. entropy rate is the information contained in the transition. In other words, the If the transtion probabilies and the last state of the Markov chain is known, the entropy rate desrcibes about the amount of uncertainy regarding the next state. From here on out, the entropy rate will be denoted with H(X t X t+1 ).

32 32 4 Contrasting Different Notions of Explorative Behaviour 4 Contrasting Different Notions of Explorative Behaviour This section will elaborate more on the role of exploration in classic reinforcement learning, introduce several approaches that use information theory to drive exploration and compare the different notions of exploration. 4.1 Exploration in Reinforcement Learning Section 2 introduced the classic reinforcement learning problem and several approaches to solve it. Underlying all these approaches is the concept of some optimal behavior to be achieved. Optimality is defined with regards to a value function that embodies an expected future return. Since the agent acts in an unknown environment, it has to perform exploration to find out what constitutes optimal behavior in this environment. During exploration, it usually performs nonoptimal. Therefore, the time used for exploration has to be minimized and used efficiently to gain enough knowledge of the environment dynamics. If the agent directly approximates the value function via temporal difference learning, learns a model it can use to plan via RMAX or E 3 or tries to optimally balance exploration and exploitation with regards to some prior about the model via Bayesian reinforcement learning does not matter, the outcome is always supposed to be optimal behavior. Frameworks such as PAC-MDP or the boundaries presented by Kolter and Ng give gurantees that the time spent performing non-optimal is limited with some probability. Thus, exploration can be seen as something that has to be done to enable the agent to do what it is supposed to do, i.e. accumulate as much reward as possible. This is reffered to as the exploration-exploitation tradeoff. Exploration can either be induced through selecting non-optimal actions with a certain probability, e.g. ɛ-greedy or Boltzmann action selection, the optimism in the face of uncertainty concept as embodied by RMAX or E 3 or by incoorparating the information gain into the expected return through the Bayesian formulation of the value. All of these approaches assume a stationary, task-depended reward function. When approximating a value function, as in temporal difference learning, the approxmiation converges over time, and after a certain amount of sampling the policy is, neglecting the noisy action selection, also fixed and optimal. With optimism in the face of uncertainty, exploration can and does end

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time 26.11.2015 Fachbereich Informatik Knowledge Engineering Group David Fischer 1 Table of Contents Problem and Motivation Algorithm

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

An Analysis of Model-Based Interval Estimation for Markov Decision Processes An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

PAC Model-Free Reinforcement Learning

PAC Model-Free Reinforcement Learning Alexander L. Strehl strehl@cs.rutgers.edu Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora ewiewior@cs.ucsd.edu Computer Science

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Bellmanian Bandit Network

Bellmanian Bandit Network Bellmanian Bandit Network Antoine Bureau TAO, LRI - INRIA Univ. Paris-Sud bldg 50, Rue Noetzlin, 91190 Gif-sur-Yvette, France antoine.bureau@lri.fr Michèle Sebag TAO, LRI - CNRS Univ. Paris-Sud bldg 50,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information