Sequential Decision Problems

Size: px
Start display at page:

Download "Sequential Decision Problems"

Transcription

1 Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted in bold red font. This will allow you to quickly compare this version of the notes to an old version. 1 Introduction To this point in the class, we have studies stateless games when the payoff matrix is known. We have explored these games as both stage (single shot) games and as repeated games. In this portion of the class, we will continue to study repeated games, but we will assume as little about the problems as possible and we will look at a class of algorithms that allow us to represent states. To be precise, we will assume that we don t know the game matrix, we don t know anything about the strategy of the other agent, and we don t know what strategy we should apply to do well in the repeated game. Within this constraints, we will explore a class of reinforcement learning algorithms that attempt to learn various parameters of the repeated game. The purpose of this tutorial is to introduce you to the necessary concepts to understand reinforcement learning algorithms. 2 Sequential Estimation To begin this tutorial, we will study a technique for sequentially estimating the mean of a random process. To this end, suppose that we have a random variable X. Recall that a random variable is a special type of function (a measurable function to be precise) that maps the set of theoretically distinguishable events in the world to the set of real numbers. The intuition behind a random variable is that the world can take on a number of distinguishable states that can be described using the axioms of probability. Though distinguishable, not all of these events can be observed directly. Rather, we have to invoke a measurement process that inherits the uncertainty of the state of nature. The random variable is a function that represents the measurement process it maps the set of distinguishable events to a real number in a lossy manner. If you don t really understand this, don t worry too much. The important thing to keep in mind is that a random variable assigns a real number to some event from the state of nature. If the mean of this random variable needs to be known, there are several ways to estimate it. Let P X (x) denote the distribution of this random variable, and let µ denote the mean of this distribution. Suppose that we want to create an estimate, m of the mean of this distribution given a sequence of observations, x 1, x 2,..., x n drawn according to P X (x). We can compute this estimate of the mean from 1

2 n samples using the following: m(n) = 1 n n x k. Unfortunately, with this form, if we want to compute the mean when we have n + 1 samples then we have to redo the sum. It would be better to come up with an iterative method where we express m(n + 1) in terms of m(n). m(n + 1) = = = 1 n+1 n + 1 x k 1 n + 1 [x n+1 + n x k 1 n + 1 x n+1 + n n + 1 m(n). We can rewrite this iterative method for computing m(n + 1) as a convex sum by setting α n+1 = 1 n+1 as follows: m(n + 1) = α n+1 x n+1 + (1 α n+1 )m(n). (1) This equation is interesting because it means that we can iteratively adjust our estimate of the mean by a weighted combination of the old estimate with a new observation. The sequential estimate for a random Figure 1: Computing the mean using sequential estimation. variable X N(µ, σ) for µ = 10 and two values of sigma is shown in Figure 1. Notice how it takes longer for the mean to converge for the random variable with the larger standard deviation. However, for both of these random variables, the sample mean will converge to the true mean as n. It is interesting to note that this property holds for not only α n = 1, but also for a very large class n of α n s provided that the values for α n get small fast, but not too fast. More precisely, given m(n) from Equation (1), the following theorem holds (which we will state, but not prove): 2

3 Theorem. If the following hold: lim n lim n n α n (2) n (α n ) 2 C < (3) then lim m(n) µ. n This theorem formally defines fast, but not too fast as constraints on how the sum of α n grows. Equation (3) says that α n eventually gets small enough that our estimate stops changing, and Equation (2) says that α n does not get small so quickly that m(n) stops changing before it reaches the true mean. We will use a variation of Equation (1) to estimate the quality of a state action pair in a sequential decision problem. We now turn attention to how we can define this quality. 3 Quality of a State-Action Pair To continue with this tutorial, restrict attention to a problem with only a single agent. Consider a world in which there are multiple states, S, but that the number of states is finite, S <. Suppose that from each state we have a finite set of actions that we can take, A, that lead us to a new state, but that the relationship between the current state and the next state is uncertain. We will model this uncertain relationship using a first order Markov process (please ask about what this is in class if you don t know), governed by the probability mass function p(s s, a). This probability mass function represents the likelihood of reaching a new state, s, from the existing state, s, when we play an action a. We call this probability mass function the transition probability, meaning, the probability that we transition from one state to the next when we take a particular action. Now, suppose that there are good things and bad things that we can do in the world, and that these good things and bad things lead to numerical rewards and penalties. Let these rewards and penalties be denoted by by the random variable R(s, a), and suppose that these reinforcers are also random. These reinforcers occur when we are in a particular state s and take an action a within this state. Our goal is to choose a good action no matter what state we find ourselves in. To this end, suppose that someone gives us a function that tells us what action to take in a given state. Denote this function by π : S A and call it our policy. How well does this particular policy work? To answer this question, we can pretend that we have a function that tells us the expected payoff for using this policy as a function of the state of the world. We will just hypothesize that such a function exists, make some assumptions about how utilities combine, and then we will start to study its properties. Denote this function V (s, π). This value function should accumulate both the immediate reward for making a choice π(s) in the current state, plus all of the future rewards that might accumulate. Since the world is probabilistic, we might spend forever jumping from one state to another. Consequently, we cannot just sum up rewards. 3

4 Instead, we will do what we did when we wanted to know how well a particular strategy works in an indefinitely repeated sequence of matrix games: we can discount future rewards. Let N(s) denote all of the states in the neighborhood of state s, meaning all of those states for which there is a nonzero probability of reaching given the transition probability p(s s, a) = p(s s, π(s)). The expected value of the policy in the given state is then [ ( ) V (s, π) = µ R (s, π(s))+γ p(s s, π(s))µ R (s, π(s ))+γ p(s s, π(s ))µ R (s, π(s ))+..., s N(s) s N(s ) (4) where µ R (s, π(s)) is the mean of the random variable that describes the reinforcer when action a = π(s) is chosen. This looks intractable because we have to average over the rewards for the next state plus the average rewards reachable from that state, and so on. Fortunately, there is a pattern that we can exploit to write a recursive definition of the function. The key to seeing this pattern is to note that the value for V (s, π) is the expected reinforcer plus the discounted average over all expected future rewards achievable from any member of the neighborhood. What is the expected future reward achievable from any member of the neighborhood, s N(s)? It is V (s, π). This means that Equation (4) can be rewritten as V (s, π) = µ R (s, π(s)) + γ p(s s, π(s))v (s, π). (5) s N(s) Thus, we have a recursive definition of the value of using a policy π : S A from state s. There are well-known algorithms for taking this recursive definition and solving for V (s, π). One technique involves rewriting Equation (5) as a series of vector equations in the variables V (s, π) for a fixed π and then solving the equation. This works well unless there are a lot of states; when there are a lot of states, solving the matrix can take a really long time (unless the neighborhoods of all of the states are small, in which case we can use techniques for solving matrix equations with sparse matrices). A technique that works when there are a lot of states is an iterative algorithm called value iteration. If you are interested in learning about this algorithm, let me know and I can point you to a reference. Although these algorithms are cool, they do not really help us get to where we are trying to go. Recall that our goal was to find the best policy given the random reinforcer and transition probability for a problem. Equation (5) only tells us the value of a policy in a given state. We could use this idea to search through all possible policies. This is theoretically computable since there are a finite number of states and actions. However, it is not practically computable because for each state there are A actions, for each pair of states there are A 2 policies, and for each set of S states there are A S. So, we will have to fiddle with Equation (5) some more. To this end, pretend that I knew the best possible policy. Denote this best possible policy by π. If you knew the policy then you could solve Equation (5) for V (s, π ) for all of the states. We can exploit the existence of this optimal policy to determine the quality of any possible actions within a given state. Let Q(s, a) denote the expected utility of choosing action a in state s assuming that I use the optimal policy thereafter. We can modify Equation (5) to help us evaluate Q(s, a) for all state-action pairs (while we still pretend that we know the optimal policy π ). This gives This means that Equation (4) can be rewritten as Q(s, a) = µ R (s, a) + γ p(s s, a)v (s, π ). (6) s N(s) 4

5 Now comes a very clever observation. If we knew Q(s, a), then we could find the optimal action in that state as a = arg max a A Q(s, a). Since the optimal policy produces the optimal action in a given state, we know that π (s) = a = arg max a A Q(s, a) which means that V (s, π ) = max a A Q(s, a). Plugging this into Equation (6) yields Q(s, a) = µ R (s, a) + γ s N(s) p(s s, a) max a A Q(s, a ). (7) We now have a recursion relation on Q(s, a) which can be solved using the value iteration algorithm. When this algorithm is done running, we have quality estimates (called Q-values) for each state-action pair when we choose optimally for every subsequent choice. 4 Q-Learning The problem with Equation (7) is that it requires us to know the transition probabilities. In keeping with our goal to use as little knowledge as possible, we should try and find an algorithm that lets us determine the Q-values without knowing the transition probabilities. In this section, we will discuss this algorithm. The algorithm, known as Q-learning, will be the basis for most of the multi-agent reinforcement learning algorithms that we will study this semester. The Q-learning algorithm uses a sequential estimate similar to the one used in Section 2. The algorithm is as follows: [ Q(s, a) (1 α)q(s, a) + α r(s, a) + γ(max a A Q(s, a)). (8) Since there is a lot going on in this equation, we will step through each of the parts. To understand the basic structure of the equation, consider state-free and simplified version of Equation (8, [ Q(a) (1 α)q(a) + α r. (9) Equation (9) does not include any dependence on s nor does it have the γ(max a A Q(s, a)) portion of Equation (8). In this form, it is easy to see that the the Q-learning equation is based on a sequential estimate for the expected value of a state, similar in form to Equation (1) with Q(a) replacing m(n) and r replacing x(n). To make this even more clear, we can replace the notation with the index notation and write Equation (9) as Q(a; n) = (1 α n 1 )Q(a; n 1) + α n 1 [r(a; n 1). (10) In this form, it is hopefully clear to you that r(a; n 1) is simply the n 1 st sample from the reinforcement random variable R, and that the n th estimate for the Q-value is just the sequential combination of this sample and the old estimate of the Q-value. Note that the Q-value is a function of the action that we are considering (each action produces a different reward) so we actually have A of these sequential estimates occurring, but that everything else in Equation (10) is analogous to Equation (1). Thus, the essence of Q-learning is creating a sequential estimate of the quality of performing a particular action. As such, we must select values of α n that get small fast, but not too fast which means that they satisfy Equations (3) and (2). 5

6 The next step in understanding the Q-learning equation is to recall that we are estimating the value of a given action in a repeated play context. Thus, we need to take into consideration the expected value of the next action that we will choose. Consider the following equation: Q(a; n) = (1 α n 1 )Q(a; n 1) + α n 1 [r(a; n 1) + γq(a; n 1). (11) This is identical to Equation (10) except for the presence of the γq(a; n 1) which is added to the sample from the reinforcer random variable. In effect, this extra term says that we are now including expected future reward, discounted by γ. For this equation, the expected discounted reward for continuing to play action a indefinitely into the future is not known for sure, but we do have a estimate of it in the form of Q(a; n 1). If Equation (11) works correctly, then we would hope to see lim Q(a; n) µ R(a) n 1 γ, meaning that our estimates of the Q-values converge to the expected discounted reward received for playing action a for all time. Indeed, this Equation (11) does cause this limit to hold, but we will postpone giving the reference that proves this claim until we further describe the Q-learning equation. The final step to interpreting the Q-learning equation is to reintroduce the state variable. Introducing state naively into Equation (11) gives Q(s, a; n) = (1 α n 1 )Q(s, a; n 1) + α n 1 [r(s, a; n 1) + γq(?, a; n 1). (12) Hopefully, you can see that introducing state does not change very much. The n th estimate of the Q-value for taking action a in state s is equal to the sequential estimate obtained by combining the old value of our estimate with a sample of the reinforcer received when we take action a in state s, modified by the term that says that we are doing this in a repeated play context. The problem is that when we take an action, we do not know for sure which next state will occur because the transition from one state to the next given an action is described by a Markov process. Fortunately, we can return to the basics of using samples from a random process to estimate the mean of the random process using a sequential estimate. Similar to the way that r(s, a; n 1) is the n 1 st sample obtained from the random variable R(s, a), we can obtain a sample from the discounted future rewareds by simply observe which of all of the possible next states in the neighborhood of s, s N(s), occurs. We will call the state that occurs s. Thus, we can replace the question mark in γq(?, a; n 1) by s yielding γq(s, a; n 1). Thus, introducing state into Equation (11) gives Q(s, a; n) = (1 α n 1 )Q(s, a; n 1) + α n 1 [r(s, a; n 1) + γq(s, a; n 1). (13) This equation says that our estimate of the quality of a state action pair is obtained as the sequential combination of our old estimate plus a new sample from the sum of our instantaneous reward, r(s, a; n 1), plus the discounted sample of future expected reward, Q(s, a; n 1). There is just one problem with this equation, and that is from the fact that we are assuming that action a is used in state s. You can see this by noting that the action a in Q(s, a; n 1) is the same action as the one used to obtain the reward, r(s, a; n 1). Since Q-values are supposed to be evaluations of the 6

7 expected reward of choosing a particular action in a given state and them choosing optimally thereafter, this is a problem. Fortunately, we have an estimate for the future expected reward for choosing optimally thereafter in the form of max a A Q(s, a; n 1). Notice that the a in this equation is a dummy variable and should not be confused with the a in r(s, a; n 1) of Q(s, a; n 1); we could just have written max b A Q(s, b; n 1). Thus, Equation (13) should really be Q(s, a; n) = (1 α n 1 )Q(s, a; n 1) + α n 1 [r(s, a; n 1) + γ max a A Q(s, a; n 1). (14) When we drop the dependence on n (but assume that it is there so that we can make sure that α gets small fast but not too fast, then we obtain Equation (8). 5 Implementing Q-Learning Watkins [1 proved that using Equation (8) causes our estimates of the Q-values to converge to the true Q-values provided that the following conditions hold: α values get small fast but not too fast (Equations (3) and (2). We run the program infinitely long (n ). We make sure to try each action in each state infinitely often (so that we do not always try the same action every time we reach, for example, some state s). These conditions seem like they are nice from a theoretical perspective, but they seem to make it impossible to use Q-learning in practice. Fortunately, we can still get very good estimates of the true Q-values using the Q-learning equation, but it takes a little parameter tuning. In the lab, you will be given a set of parameters that work pretty well and you will then play around with these parameters to see how they affect what is learned. In the remainder of this section, we will discuss some of the things that we can play around with and how they affect the behavior of the algorithm. 5.1 Starting and Stopping in a Path-Finding Problem In the upcoming lab, you will experiment with using Q-learning so solve a path-finding problem. The task will be to learn a table of Q-values (a table since you need to find values for all possible states and all possible actions from within those states). One way that you will measure the performance of your values will be by computing the average path length from a given starting point to the goal location. Penalties (r(s, a) < 0) will be assessed when you run into walls, but a payoffs (r(s, a) > 0) will only be delivered when you reach the goal. The first question to answer is how to compute the Q-values for the goal state. Let s G denote the goal state, so the question that we are trying to answer is how do we determine Q(s G, a) for all values of a. The easiest way to do this is to set rewards to zero for all values of s and a, except for when s = s G. When s = s G, set the rewards to r(s G, a) = c for some constant c for all actions. Under these conditions, the Q-values for this state will eventually become c 1 /gamma 7 for all actions. Since this can be computed offline,

8 you can use just set this before the algorithm starts to run. Then, when the goal state is reached during a learning episode, you can conclude the episode and begin a new episode at some starting location. This leads to the question of how to choose states from which to learn. One way to learn is to always begin in the starting location and then explore (using techniques outlined in the next section) until the goal is reached. The problem with this approach is that, because transitions from one state to another are only probabilistically determined by the chosen action, learning only from the given starting state will cause the state space to be unevenly sampled; states that are near to the direct path will be explored a lot because they will be reached with a high probability, and states that are far from the direct path will rarely be observed. Although this is not a problem from a theoretical perspective (because the algorithm will run an infinite number of times), it is a problem when we run the algorithm a limited number of times. One way to more uniformly sample the state space is to start at a random location in the state space, learn using one of the exploration methods described below, and then conclude learning when the goal state is reached. Randomly restarting will tend to produce a more uniform coverage of the state space and will make learning approach the true values more quickly, which means that we have a hope that the Q-values will be useful even if we do not let the algorithm run forever. 5.2 Exploration versus Exploitation One of the easiest ways to ensure that every action is taken from every state infinitely often is to randomly choose from the set of possible actions available to the agent in a given state. This ensures that there is enough coverage of the possible actions to lead the Q-learning algorithm to convergence. In a path-finding problem, we could theoretically randomly choose a state, randomly choose an action from that state, observe the reinforcement and state that results from choosing that action in that state, update the Q-values, and then repeat the entire process. This process never actually finds a path from a starting state to the goal, but if we wait long enough and have our α values decay properly, then the estimated Q-values will eventually approach the true Q-values. When we shift back to reality, we cannot wait forever for convergence. Instead, we want to try and determine if the estimated Q-values are getting close to true Q-values, or at least close enough that we are solving the problem we set out to solve. In a path-finding problem, we do this by (a) checking to see if the path is getting more efficient and (b) checking to see if we are having wild swings in the Q-values as things progress. To do this, we need to try and choose actions that bias the learning toward those actions that are likely to be closer to the optimal actions. Although we need to try and do this, we do not want to go so far that we forget to explore around a bit. In other words, we want to find a balance between exploiting what we have learned (and thus bias our learning toward actions that are more likely to be successful), and exploring (and thus ensure that sufficient exploration takes place to prevent premature convergence). In the lab, you will experiment with several different approaches to finding this tradeoff. At one extreme, you will always exploit what has been learned and depend entirely on the randomness in the world (and possibly random restarts) to explore. At the other extreme, you will randomly select actions and rely on the randomness in the world to lead you to the goal. 8

9 5.3 Decaying α In theory, any α value that decays fast but not too fast will lead to convergence. In practice, we have to be more selective. For example, we have already seen that α n = 1 satisfies the conditions of Equations (3) n and (2), but this is not a very good idea in practice because the values of n tend to get small too early in the learning which means that changes in Q-values in one region of the state space propagate to other regions of the state space very slowly. In the lab, we take the max of two different functions of α decay: one that stays high for a long time and then decays to zero exponentially fast, and the other that decays linearly with n. Another approach is to use a different α for each state, α(s; n), and then decay this state-dependent α only when the state is visited. 5.4 Convergence Issues Since we cannot explore forever, we want to decide when to quit. The approach suggested in the lab is to measure how much the Q values change as learning progresses. When this value gets really small, we can safely guess that most of the learning has stopped. At the very least, we need to evaluate how much things s S have changed for all possible Q-values, so we test whether a A Q(s, a; n) Q(s, a; n 1) < ɛ. When you do this, watch out for two commonly made mistakes (based on the 2004 lab write-ups). First, since the α values decay, it is possible to find that the Q-values have not changed very much simply because the value of α is so small. Figure out a way to test for convergence that avoids this problem. Second, you probably do not want to test convergence using a random restart that produces a very short path (e.g., starting right next to the goal and exploiting current Q-value estimates); such an approach will almost always produce small changes from one iteration to the next simply because so few Q-values could possibly be changed. Figure out a way to test for convergence that avoids this problem too. References [1 C. J.C.H. Watkins and P. Dayan. Q-learning. Machine Learning, 8: ,

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Quadratic Equations Part I

Quadratic Equations Part I Quadratic Equations Part I Before proceeding with this section we should note that the topic of solving quadratic equations will be covered in two sections. This is done for the benefit of those viewing

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava Q-learning Tutorial CSC411 Geoffrey Roeder Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava Tutorial Agenda Refresh RL terminology through Tic Tac Toe Deterministic

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. This document was written and copyrighted by Paul Dawkins. Use of this document and its online version is governed by the Terms and Conditions of Use located at. The online version of this document is

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14 PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14 GENERAL INFO The goal of this lab is to determine the speed of sound in air, by making measurements and taking into consideration the

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Sequences and infinite series

Sequences and infinite series Sequences and infinite series D. DeTurck University of Pennsylvania March 29, 208 D. DeTurck Math 04 002 208A: Sequence and series / 54 Sequences The lists of numbers you generate using a numerical method

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Algebra Year 10. Language

Algebra Year 10. Language Algebra Year 10 Introduction In Algebra we do Maths with numbers, but some of those numbers are not known. They are represented with letters, and called unknowns, variables or, most formally, literals.

More information

MITOCW MIT6_041F11_lec17_300k.mp4

MITOCW MIT6_041F11_lec17_300k.mp4 MITOCW MIT6_041F11_lec17_300k.mp4 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality, educational resources for

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

CHAPTER 7: TECHNIQUES OF INTEGRATION

CHAPTER 7: TECHNIQUES OF INTEGRATION CHAPTER 7: TECHNIQUES OF INTEGRATION DAVID GLICKENSTEIN. Introduction This semester we will be looking deep into the recesses of calculus. Some of the main topics will be: Integration: we will learn how

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Uncertainty. Michael Peters December 27, 2013

Uncertainty. Michael Peters December 27, 2013 Uncertainty Michael Peters December 27, 20 Lotteries In many problems in economics, people are forced to make decisions without knowing exactly what the consequences will be. For example, when you buy

More information

Introduction to Algebra: The First Week

Introduction to Algebra: The First Week Introduction to Algebra: The First Week Background: According to the thermostat on the wall, the temperature in the classroom right now is 72 degrees Fahrenheit. I want to write to my friend in Europe,

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10 EECS 70 Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10 Introduction to Basic Discrete Probability In the last note we considered the probabilistic experiment where we flipped

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2018

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2018 ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER 1 2017/2018 DR. ANTHONY BROWN 1. Arithmetic and Algebra 1.1. Arithmetic of Numbers. While we have calculators and computers

More information

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model: Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

MI 4 Mathematical Induction Name. Mathematical Induction

MI 4 Mathematical Induction Name. Mathematical Induction Mathematical Induction It turns out that the most efficient solution to the Towers of Hanoi problem with n disks takes n 1 moves. If this isn t the formula you determined, make sure to check your data

More information

Algebra Year 9. Language

Algebra Year 9. Language Algebra Year 9 Introduction In Algebra we do Maths with numbers, but some of those numbers are not known. They are represented with letters, and called unknowns, variables or, most formally, literals.

More information

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 8

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 8 6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 8 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Discrete Probability and State Estimation

Discrete Probability and State Estimation 6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes

More information

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests: One sided tests So far all of our tests have been two sided. While this may be a bit easier to understand, this is often not the best way to do a hypothesis test. One simple thing that we can do to get

More information

Key Point. The nth order linear homogeneous equation with constant coefficients

Key Point. The nth order linear homogeneous equation with constant coefficients General Solutions of Higher-Order Linear Equations In section 3.1, we saw the following fact: Key Point. The nth order linear homogeneous equation with constant coefficients a n y (n) +... + a 2 y + a

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

MATH 341, Section 001 FALL 2014 Introduction to the Language and Practice of Mathematics

MATH 341, Section 001 FALL 2014 Introduction to the Language and Practice of Mathematics MATH 341, Section 001 FALL 2014 Introduction to the Language and Practice of Mathematics Class Meetings: MW 9:30-10:45 am in EMS E424A, September 3 to December 10 [Thanksgiving break November 26 30; final

More information

5.2 Infinite Series Brian E. Veitch

5.2 Infinite Series Brian E. Veitch 5. Infinite Series Since many quantities show up that cannot be computed exactly, we need some way of representing it (or approximating it). One way is to sum an infinite series. Recall that a n is the

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

3.3 Limits and Infinity

3.3 Limits and Infinity Calculus Maimus. Limits Infinity Infinity is not a concrete number, but an abstract idea. It s not a destination, but a really long, never-ending journey. It s one of those mind-warping ideas that is difficult

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Lecture 6: Lies, Inner Product Spaces, and Symmetric Matrices

Lecture 6: Lies, Inner Product Spaces, and Symmetric Matrices Math 108B Professor: Padraic Bartlett Lecture 6: Lies, Inner Product Spaces, and Symmetric Matrices Week 6 UCSB 2014 1 Lies Fun fact: I have deceived 1 you somewhat with these last few lectures! Let me

More information