A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Size: px
Start display at page:

Download "A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time"

Transcription

1 A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement learning, which outputs a policy that is PAC-MDP (Probably Approximately Correct in Markov Decision Process) and discuss its merits and drawbacks. 1 Introduction Reinforcement learning is a method by which an agent learns how to act in order to maximize the cumulative reward. A classical example is a cart with a swinging pole that learns how to adjust its velocity so that the pole balances upright for as long as possible. In this case, the cart controller is the agent, the velocity is the action, and the duration and accuracy of pole-balancing is the reward. The cart controller, in the beginning of the process, has little information about how much reward will be given if the cart moved forward at a certain speed. Over some number of interactions with the unknown environment the agent must learn which actions in which situations will maximize the payoff. This process is typically modeled by a Markov decision process with the assumption that the agent has no knowledge about the parameters of the process. In the cart-pole example given above, the cart controller faces the exploitation-exploration trade-off: at any given time, should the cart exploit its current knowledge of the environment to maximize the payoff, staying oblivious to actions with potentially better outcome, or should it explore by trying different movements in the hopes of finding a policy that will result in higher payoff, at the expense of computation time? In this exposition, we discuss how Kearns and Singh addresses this fundamental tradeoff problem by describing the E 3 algorithm, which is PAC-MDP (Probably Approximately Correct in Markov Decision Processes) as first defined in Strehl et al. [2006]: with high probability the algorithm outputs close to optimal policy in time polynomial in the size of the system and approximation parameters. This was the first ever algorithm to have 1

2 explicitly dealt with the explore-exploit dilemma and inspired Brafman and Tennenholtz to give a generalized algorithm (R-max algorithm) that can also be used in adversarial settings [Brafman and Tennenholtz, 2003]. Unfortunately, the E 3 algorithm may not be ideal for reinforcement learning in real world. First, Kearns and Singh assume that the agent can observe the state of the environment. The second caveat stems from the fact that the E 3 algorithm is model-based: it builds a model for the world through exploration of the environment. This approach is useful when there is a manageable number of states, but may be unfeasible if the state space is large. Although the E 3 algorithm only maintains a partial model, this may be big if, for instance, if the state space is continuous. Algorithms such as Q-learning are model-free, maintaining only a value function for each action-state pair, but its convergence is asymptotic, requiring infinite exploration to reach the optimum. 2 Preliminaries In this section we give definitions of crucial concepts used in Kearns and Singh [2002] as well as in reinforcement learning. Definition 1. A Markov decision process (MDP) M on the set of states {1, 2,..., N} with actions a 1,..., a k is defined by the transition probability PM a (ij), which is the probability that taking action a at state i will lead to state j, for each state i, j and action a, and the payoff distributions for each state i with mean R M (i) satisfying 0 R M (i) R max and variance Var M (i) Var max (i). Thus, for every state-action pair (i, a), N j=1 P M a (ij) = 1. The upper bounds on the mean and the variance of rewards are necessary for finite time convergence. Furthermore, the non-negativity assumption on rewards can be removed by adding large constants to every reward. Definition 2. A policy in an MDP M is a mapping π from the set of states to the set of actions. Hence, a policy π induces a stationary distribution on the states, as for every state there is only one action given by π. For the purpose of the exposition we only deal with unichain MDPs, which are MDPs for which every policy has a well-defined stationary distribution. That is, the action chosen by a policy from a given state does not depend on the time of the arrival at that state. Hence, the stationary distribution of any policy does not depend on the starting state. We move on to concepts regarding finite-length paths in MDPs. 2

3 Definition 3. A T -path in M is a sequence of T + 1 states (i.e. T transitions). The probability that a T -path p = i 1... i T +1 is traversed in M upon starting in state i 1 and executing policy π is Pr π M[p] = T k=1 P π(i k) M (i k, i k+1 ). (1) Definition 4. The expected undiscounted return along p in M is U M (p) = 1 T T R ik (2) k=1 and the expected discounted return along p in M is V M (p) = T γ k 1 R ik (3) k=1 where γ (0, 1) is a discount factor to discount the value of future rewards. The T -step undiscounted return from state i is U π M(i, T ) = p Pr π M[p]U M (p) (4) where p is any T -path in M starting at state i, and the discounted version is defined similarly. The optimal T -step undiscounted return is defined as UM (i, T ) = max π UM π (i, T ) and similarly for VM (i, T ). Furthermore, let UM (i) = lim T UM (i, T ), V M (i) = lim T VM (i, T ). Note that with the unichain assumption, in the undiscounted case, the optimal return does not depend on the starting state i so we can simply write UM. Also denote by GT max the maximum payoff obtainable from any T -path in M. We also use the notion of mixing time to describe the minimum number of steps T such that the distribution on states after T steps of executing π is within ɛ of the stationary distribution induced by π, using some measure of distance between distributions, such as the Kullback-Leibler divergence. In fact, Kearns and Singh use a notion of mixing time defined in terms of the expected return. Definition 5. Let M be an MDP and π be a policy that induces a well-defined stationary distribution on the states. The ɛ-return mixing time of π is the number T satisfying for any state i T T U π M(i, T ) U π M ɛ (5) 3

4 In short, whereas the standard notion of mixing time gives us an approximate distribution on the states, the ɛ-return mixing time ensures that the expected return is close enough to the asymptotic payoff. The latter is polynomially bounded by the standard mixing time (see Lemma 1, Kearns and Singh [2002]). The analogous concept in the discounted case is described in the following lemma. Lemma 1. (Lemma 2, Kearns and Singh [2002]) Let M be any MDP, and let π be any policy in M. If T (1/(1 γ)) log(r max /(ɛ(1 γ)) (6) then for any state i, V π M(i, T ) V π M(i) V π M(i, T ) + ɛ. (7) We call the value of the lower bound on T given above the ɛ-horizon time for the discounted MDP M. The following presents yet another convenient notation used throughout the paper. Definition 6. We denote by( Π T,ɛ M ) the class of all ergodic policies in an MDP M with ɛ-return mixing time T, and opt denotes the optimal expected asymptotic undiscounted return by any policy in Π T,ɛ M. Π T,ɛ M 3 The E 3 Algorithm The main theorem of Kearns and Singh [2002] states the existence of a polynomial time algorithm that returns a near-optimal policy for any given MDP. Theorem 1. (Main Theorem of Kearns and Singh [2002]) Let M be an MDP over N states. (Undiscounted case) There exists an algorithm A taking inputs ɛ, δ, N, T and opt(π T,ɛ M ), such that the total number of actions and computation time taken by A is polynomial in 1/ɛ, 1/δ, N, T, and R max, and with probability at least 1 δ the total actual return of A exceeds opt(π T,ɛ M ) ɛ. (Discounted case) There exists an algorithm A taking inputs ɛ, δ, N, and V (i), such that the total number of actions and computation time taken by A is polynomial in 1/ɛ, 1/δ, N, the horizon time T = 1/(1 γ), and R max, and with probability at least 1 δ, A will halt in a state i, and output a policy ˆπ such that VM ˆπ (i) V M (i) ɛ. Kearns and Singh prove the above by describing the algorithm explicitly. Initially, the algorithm begins by performing what is called balanced wandering, described below in Algorithm 1. In this process we take random actions from newly visited states, but from already visited states take least tried actions in order to build a balanced model of the actual world 4

5 Algorithm 1 Balanced Wandering if i has never been visited before then Take arbitrary action else if i has been visited < m known times then Take action that has been tried the fewest times from i If there are many of such actions, choose at random else Add i to S since i has been visited m known times end if M. While wandering, the algorithm keeps track of the empirical transition probabilities and payoff distributions for each state. Once a state has been visited sufficient number of times, we say that the state is now known, since we have a sufficiently good statistics of that state. The explore-exploit decision takes place once the algorithm reaches an already known state. If the optimal policy based on our current model is good enough by some measure, then the algorithm exploits, but otherwise, it explores other options. One of the key technical lemmas (the explore or exploit lemma) guarantees that the either our current model has good immediate payoff, or we can quickly explore into yet-unknown states. It is important to note that the exploration step can be executed at most N(m known 1) times, where m known is the number of visits to a state needed to know the state and is O(((NT G T max)/ɛ) 4 Var max log(1/δ)) (shown in Kearns and Singh [2002], page 16), so we are guaranteed to find a near-optimal policy in not too many steps. In the following sections, we discuss two major lemmas for the main theorem, and discuss the algorithm in further detail. 3.1 The Simulation Lemma The first of the two key technical lemmas states that if our empirical estimate ˆM of the true MDP M approximates M within some small error in terms of ɛ, then ˆM s expected return is within ɛ error of the true return. The measure of closeness of ˆM to M is given by the closeness of the transition probabilities and mean payoff at each state. With error ±α, Kearns and Singh call ˆM an α-approximation of M. Lemma 2. (Simulation Lemma, Kearns and Singh [2002]) Let M be an MDP over N states. (Undiscounted case) Let ˆM be an O ( (ɛ/(nt G T max) ) 2 ) -approximation of M. Then for any policy π in Π T,ɛ/2, and for any state i M U π M(i, T ) ɛ U πˆm(i, T ) U π M(i, T ) + ɛ. (8) 5

6 (Discounted case) Let T (1/(1 γ)) log(r max /(ɛ(1 γ))), and let ˆM be an O((ɛ/(NT G T max)) 2 )- approximation 1 of M. Then for any policy π and any state i, V π M(i, T ) ɛ V πˆm(i, T ) V π M(i, T ) + ɛ. (9) Next, the authors establish the bound on number m known of visits to any state that will guarantee the desired accuracy of the transition probability and payoff estimates. This bound is on the order polynomial in the size of M, 1/ɛ, and log(1/δ) where δ is our choice of probability of failure. 3.2 The Explore or Exploit Lemma The next lemma relies on the algorithm s partial model of the actual MDP M. Kearns and Singh introduce the induced MDP M S on S defined in an intuitive way: the payoff of state i in M S is the payoff of i in M (except the rewards have zero variance in M S ), and for every i, j S and any action, the transition probabilities are defined as in M. However, there in fact exists an additional, absorbing state s 0 for M S, representing all of the yet-unknown states. This state has zero payoff and all transitions that are not within S are redirected to s 0. Whereas M S inherits the true transition probabilities and payoffs from M, the algorithm does not have access to this information. Rather, throughout balanced wandering, the algorithm builds an empirical estimate of the transition probabilities and payoffs in M S, denoted ˆM S. The authors first establish that ˆM S is probably approximately correctly simulates M S in terms of the payoffs (Lemma 6, Kearns and Singh [2002]), and makes a simple yet important observation that the payoff achievable in M S, and thus approximately achievable in ˆM S, is also achievable in M. The following lemma is a crucial step to proving the main theorem. It states that at any time step in which we encounter a known state, either the current optimal policy in ˆM S yields an immediate payoff that is close to the optimal payoff in M, or another policy based on ˆM S will quickly explore an unknown state. Lemma 3. (Explore or Exploit Lemma, Kearns and Singh [2002]) Let M be an MDP, let S be any subset of the states of M, and let M S be the induced MDP on S 2. For any i S, any T, and any 1 > α > 0, either there exists a policy π in M S such that UM π S (i, T ) UM (i, T ) α (respectively, VM π S (i, T ) VM (i, T ) α in the discounted case), or there exists a policy π in M S such that the probability that executing π for T steps ends in s 0 is at least α/g T max. 1 There was a missing open parenthesis in the published work. It is corrected here. 2 Typo from the original paper corrected here. 6

7 3.3 The Explicit Explore or Exploit (E 3 ) Algorithm The E 3 algorithm is summarized in Algorithm 2. Note that balanced wandering occurs only among unknown states, since we are guaranteed to have collected good enough statistics for known states. Let us turn our attention to when the algorithm encounters a known state. By ˆM S we denote the MDP with the same transition probabilities as ˆM S but with high payoff for exploration (R ˆM S (s 0 ) = R max ) and no payoff for the known states. The policies ˆπ and ˆπ are obtained by value iteration [Bertsekas and Tsitsiklis, 1989] on ˆM S and ˆM S respectively, which takes O(N 2 T ) time. The algorithm then checks if the return from i by ˆπ is at least opt(π T,ɛ M ) ɛ/2 (respectively, VM (i) ɛ/2 in the discounted case). If so, the algorithm executes ˆπ for the next T steps where T is the ɛ/2-return mixing time (respectively, the algorithm halts and outputs ˆπ). Otherwise, the algorithm attempts exploration by executing ˆπ, which has a high reward for unknown states in M, for the next T steps in M. By the Explore or Exploit Lemma, ˆπ will enter an unknown state with probability at least ɛ/(2g T max). Algorithm 2 The Explicit Explore or Exploit (E 3 ) Algorithm Initially, the set of known states S is empty. while learning M do if the current state i / S then Perform balanced wandering. else Compute optimal policies ˆπ and ˆπ in ˆ M S and ˆM S. if ˆπ obtains a near-optimal payoff then Attempt exploitation using ˆπ. else Attempt exploration using ˆπ. end if If at any point of the attempts we visit a state i / S, perform balanced wandering. end if end while Three sources of failure are present: (1) the estimate ˆM S is not a good approximation of ˆM S, (2) the exploration attempt take too long to render a new state known, and, though, only for the undiscounted case, (3) the exploitation attempts using ˆπ repeatedly yield < opt(π T,ɛ M ) ɛ/2 rewards. For all three scenarios, Kearns and Singh show that we can bound the probability of each as small as we like with δ. 7

8 4 Discussion The E 3 algorithm uses only a partial model ˆM S of the entire MDP M and yet yields a nearly-optimal policy. Moreover, it was the first algorithm to demonstrate finite bounds on the number of actions and computation time required to achieve such policy. However, E 3 fails to address the two greatest difficulties in applying reinforcement learning in real world. As Kearns and Singh mentions, polynomial bounds in the number of states may be unsatisfactory if the state size is large. We must also note that all big-o bounds are in fact also polynomial in k, the size of the action space, as the authors assumed for convenience that k is a constant. Hence, the authors concern about E 3 being unfeasible with large state space extends to impracticality with large action space, or just any large system in general. Some algorithms exist for large state spaces but are only experimentally shown to work well [Jong and Stone [2007], Doya [2000]]. One promising theoretical result by Kakade and Langford outlines an algorithm that is claimed to return a near-optimal policy efficiently, where the approximation and computation parameters do not explicitly depend on the size of the system, but I have not had time to study their work. The algorithm also assumes that the agent can observe the state of the environment, which is not always the case in real world. Hence, it will be interesting to study reinforcement learning in the context of POMDPs (Partially Observable Markov Decision Processes) such as in Kimura et al. [1997]. Kearns and Singh proposes as possible future work a model-free approach. Inspired by algorithms such as the E 3, R-max, and Q-learning, Strehl et al. devised a new algorithm called Delayed Q-learning, which provably outputs a probably approximately optimal policy in a model-free setting in polynomial time. References D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, R. I. Brafman and M. Tennenholtz. R-max - a General Polynomial Time Algorithm for Near-optimal Reinforcement Learning. J. Mach. Learn. Res., 3: , Mar ISSN doi: / URL K. Doya. Reinforcement learning in continuous time and space. Neural computation, 12(1): , N. K. Jong and P. Stone. Model-based Function Approximation in Reinforcement Learning. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS 07, pages 95:1 95:8, New York, NY, USA, ACM. ISBN 8

9 doi: / URL S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages , M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3): , H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in pomdps with function approximation. In ICML, volume 97, pages , A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages ACM,

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Near-Optimal Reinforcement Learning in Polynomial Time

Near-Optimal Reinforcement Learning in Polynomial Time Near-Optimal Reinforcement Learning in Polynomial Time ichael Kearns AT&T Labs 180 Park Avenue, Room A235 Florham Park, New Jersey 07932 mkearns@research.att.com Satinder Singh Department of Computer Science

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

PAC Model-Free Reinforcement Learning

PAC Model-Free Reinforcement Learning Alexander L. Strehl strehl@cs.rutgers.edu Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora ewiewior@cs.ucsd.edu Computer Science

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Exploration in Metric State Spaces

Exploration in Metric State Spaces Exploration in Metric State Spaces Sham Kakade Michael Kearns John Langford Gatsby Unit Department of Computer and Information Science University College London University of Pennsylvania London, England

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

An Analysis of Model-Based Interval Estimation for Markov Decision Processes An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University

More information

Online regret in reinforcement learning

Online regret in reinforcement learning University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Discrete planning (an introduction)

Discrete planning (an introduction) Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Sample Complexity of Multi-task Reinforcement Learning

Sample Complexity of Multi-task Reinforcement Learning Sample Complexity of Multi-task Reinforcement Learning Emma Brunskill Computer Science Department Carnegie Mellon University Pittsburgh, PA 1513 Lihong Li Microsoft Research One Microsoft Way Redmond,

More information

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes Journal of Computer and System Sciences 74 (2008) 1309 1331 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.elsevier.com/locate/jcss An analysis of model-based Interval

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

A Theoretical Analysis of Model-Based Interval Estimation

A Theoretical Analysis of Model-Based Interval Estimation Alexander L. Strehl Michael L. Littman Department of Computer Science, Rutgers University, Piscataway, NJ USA strehl@cs.rutgers.edu mlittman@cs.rutgers.edu Abstract Several algorithms for learning near-optimal

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING

PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING BY ALEXANDER L. STREHL A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New Jersey

More information

Provably Efficient Learning with Typed Parametric Models

Provably Efficient Learning with Typed Parametric Models Journal of Machine Learning Research 0 (2009) 955-988 Submitted 9/08; Revised 3/09; Published 8/09 Provably Efficient Learning with Typed Parametric Models Emma Brunskill Computer Science and Artificial

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Efficient Reinforcement Learning in Factored MDPs

Efficient Reinforcement Learning in Factored MDPs In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI99), Stockholm, Sweden, August 1999 Efficient Reinforcement Learning in Factored MDPs Michael Kearns ATT

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Efficient Reinforcement Learning in Factored MDPs

Efficient Reinforcement Learning in Factored MDPs Efficient Reinforcement Learning in Factored MDPs Michael Kearns AT&T Labs mkearns@research.att.com Daphne Koller Stanford University koller@cs.stanford.edu Abstract We present a provably efficient and

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

A Nonlinear Predictive State Representation

A Nonlinear Predictive State Representation Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu

More information

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Ronen I. Brafman Department of Computer Science Stanford University Stanford, CA 94305 brafman@cs.stanford.edu Moshe Tennenholtz

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs Roy Fox Computer Science Department, Technion IIT, Israel Moshe Tennenholtz Faculty of Industrial

More information

Selecting the State-Representation in Reinforcement Learning

Selecting the State-Representation in Reinforcement Learning Selecting the State-Representation in Reinforcement Learning Odalric-Ambrym Maillard INRIA Lille - Nord Europe odalricambrym.maillard@gmail.com Rémi Munos INRIA Lille - Nord Europe remi.munos@inria.fr

More information

Learning Equilibrium as a Generalization of Learning to Optimize

Learning Equilibrium as a Generalization of Learning to Optimize Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,

More information

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Jacob W. Crandall and Michael A. Goodrich Computer Science Department Brigham Young University Provo, UT 84602

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Cyclic Equilibria in Markov Games

Cyclic Equilibria in Markov Games Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Multi-Agent Learning with Policy Prediction

Multi-Agent Learning with Policy Prediction Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University

More information

Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning

Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning Freie Universität Berlin Fachbereich Mathematik und Informatik Master Thesis Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning Michael Borst Advisor:

More information

Bellmanian Bandit Network

Bellmanian Bandit Network Bellmanian Bandit Network Antoine Bureau TAO, LRI - INRIA Univ. Paris-Sud bldg 50, Rue Noetzlin, 91190 Gif-sur-Yvette, France antoine.bureau@lri.fr Michèle Sebag TAO, LRI - CNRS Univ. Paris-Sud bldg 50,

More information

Combining Memory and Landmarks with Predictive State Representations

Combining Memory and Landmarks with Predictive State Representations Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Learning Near-Pareto-Optimal Conventions in Polynomial Time

Learning Near-Pareto-Optimal Conventions in Polynomial Time Learning Near-Pareto-Optimal Conventions in Polynomial Time Xiaofeng Wang ECE Department Carnegie Mellon University Pittsburgh, PA 15213 xiaofeng@andrew.cmu.edu Tuomas Sandholm CS Department Carnegie Mellon

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Towards Faster Planning with Continuous Resources in Stochastic Domains

Towards Faster Planning with Continuous Resources in Stochastic Domains Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information