Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning

Size: px
Start display at page:

Download "Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning"

Transcription

1 Equivalence Between Wasserstein and Value-ware Loss for Model-based Reinforcement Learning Kavosh sadi 1 Evan Cater 1 Dipendra Misra Michael L Littman 1 bstract Learning a generative model is a key component of model-based reinforcement learning Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning Recently Farahmand et al 017 proposed a value-aware model learning VML objective that captures the structure of value function during model learning Using tools from sadi et al 018, we show that minimizing the VML objective is in fact equivalent to minimizing the Wasserstein metric This equivalence improves our understanding of value-aware models, and also creates a theoretical foundation for applications of Wasserstein in model-based reinforcement learning 1 Introduction The model-based approach to reinforcement learning consists of learning an internal model of the environment and planning with the learned model Sutton & Barto, 1998 The main promise of the model-based approach is data-efficiency: the ability to perform policy improvements with a relatively small number of environmental interactions lthough the model-based approach is well-understood in the tabular case Kaelbling et al, 1996; Sutton & Barto, 1998, the extension to approximate setting is difficult Models usually have non-zero generalization error due to limited training samples Moreover, the model 1 Department of Computer Science, Brown University, Providence, US Department of Computer Science, Cornell Tech, New York, US Correspondence to: Kavosh asadi <kavosh@brownedu> learning problem can be unrelizable, leading to an imperfect model with irreducible error Ross & Bagnell, 01; Talvitie, 014 Sometimes referred to as the compounding error phenomenon, it has been shown that such small modeling errors can also compound after multiple steps and degrade the policy learned using the model Talvitie, 014; Venkatraman et al, 015; sadi et al, 018 On way of addressing this problem is by learning a model that is tailored to the specific planning algorithm we intend to use That is, even though the model is imperfect, it is useful for the planning algorithm that is going to leverage it To this end, Farahmand et al 017 proposed an objective function for model-based RL that captures the structure of value function during model learning to ensure that the model is useful for Value Iteration Learning a model using this loss, known as value-aware model learning VML loss, empirically improved upon a model learned using maximum-likelihood objective, thus providing a promising direction for learning useful models in the approximate setting More specifically, VML minimizes the maximum Bellman error given the learned model, MDP dynamics, and an arbitrary space of value functions s we will show, computing the Wasserstein metric involves a similar maximization problem, but over a space of Lipschitz functions Under certain assumptions, we prove that the value function of an MDP is Lipschitz Therefore, minimizing the VML objective is in fact equivalent to minimizing Wasserstein Background 1 MDPs We consider the Markov decision process MDP setting in which the RL problem is formulated by the tuple S,, R, T, γ Here, S denotes a state space and denotes an action set The functions R : S R and T : S PrS denote the reward and transition dynamics Finally γ [0, 1 is the discount rate ccepted at the FIM workshop Prediction and Generative Modeling in Reinforcement Learning, Stockholm, Sweden, 018 Copyright 018 by the authors

2 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning Lipschitz Continuity We make use of the notion of smoothness of a function as quantified below Definition 1 Given two metric spaces M 1, d 1 and M, d consisting of a space and a distance metric, a function f : M 1 M is Lipschitz continuous sometimes simply Lipschitz if the Lipschitz constant, defined as d fs1, fs K d1,d f :=, 1 s 1 S,s S d 1 s 1, s is finite Equivalently, for a Lipschitz f, s 1, s d fs1, fs K d1,d f d 1 s 1, s Note that the input and output of f can generally be scalars, vectors, or probability distributions Lipschitz function f is called a non-expansion when K d1,d f = 1 and a contraction when K d1,d f < 1 We also define Lipschitz continuity over a subset of inputs: Definition function f : M 1 M is uniformly Lipschitz continuous in if fs1, a, fs, a Kd 1,d f := is finite d a d 1 s 1, s, Note that the metric d 1 is still defined only on M 1 Below we also present two useful lemmas Lemma 1 Composition Lemma Define three metric spaces M 1, d 1, M, d, and M 3, d 3 Define Lipschitz functions f : M M 3 and g : M 1 M with constants K d,d 3 f and K d1,d g Then, h : f g : M 1 M 3 is Lipschitz with constant K d1,d 3 h K d,d 3 fk d1,d g Proof K d1,d 3 h f gs 1, f gs d 3 = s1,s d = d 1 s 1, s gs1, gs d 1 s 1, s gs1, gs d d 1 s 1, s = K d1,d gk d,d 3 f d 3 f gs 1, f gs d gs1, gs fs1, fs d 3 d s 1, s Lemma Summation Lemma Define two vector spaces M 1, and M, Define Lipschitz functions f : M 1 M and g : M 1 M with constants K, f and K, g Then, h : f + g : M 1 M is Lipschitz with constant K, h K, f + K, g Proof fs + gs fs 1 gs 1 K d1,d h := s s 1 fs fs 1 + gs gs 1 s s 1 s s 1 fs fs 1 s s 1 gs gs 1 + s s 1 = K, f + K, g 3 Distance Between Distributions We require a notion of difference between two distributions quantified below Definition 3 Given a metric space M, d and the set PM of all probability measures on M, the Wasserstein metric or the 1st Kantorovic metric between two probability distributions µ 1 and µ in PM is defined as W µ 1, µ := inf j Λ js 1, s ds 1, s ds ds 1, 3 where Λ denotes the collection of all joint distributions j on M M with marginals µ 1 and µ Vaserstein, 1969 Wasserstein is linked to Lipschitz continuity using duality: µ1 W µ 1, µ = s µ s fsds f:k d,dr f1 4 This equivalence is known as Kantorovich-Rubinstein duality Kantorovich & Rubinstein, 1958; Villani, 008 Sometimes referred to as Earth Mover s distance, Wasserstein has recently become popular in machine learning, namely in the context of generative adversarial networks rjovsky et al, 017 and value distributions in reinforcement learning Bellemare et al, 017 We also define Kullback Leibler divergence simply KL as an alternative measure of difference between two distributions: KLµ 1 µ := µ 1 s log µ 1s µ s ds

3 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning 3 Value-ware Model Learning VML Loss The basic idea behind VML Farahmand et al, 017 is to learn a model tailored to the planning algorithm that intends to use it Since Bellman equations Bellman, 1957 are in the core of many RL algorithms Sutton & Barto, 1998, we assume that the planner uses the following Bellman equation: Qs, a = Rs, a + γ T s s, af Qs, ds, where f can generally be any arbitrary operator Littman & Szepesvári, 1996 such as max We also define: vs := f Qs, good model T could then be thought of as the one that minimizes the error: lt, T s, a = Rs, a + γ T s s, avs ds Rs, a γ T s s, avs ds T = γ s s, a T s s, a vs ds Note that minimizing this objective requires access to the value function in the first place, but we can obviate this need by leveraging Holder s inequality: T l T, T s, a = γ s s, a T s s, a vs ds γ T s s, a T s s, a v 1 Further, we can use Pinsker s inequality to write: T s, a T s, a KL T s, a T s, a 1 This justifies the use of maximum likelihood estimation for model learning, a common practice in model-based RL Bagnell & Schneider, 001; bbeel et al, 006; gostini & Celaya, 010, since maximum likelihood estimation is equivalent to empirical KL minimization However, there exists a major drawback with the KL objective, namely that it ignores the structure of the value function during model learning s a simple example, if the value function is constant through the state-space, any randomly chosen model T will, in fact, yield zero Bellman error However, a model learning algorithm that ignores the structure of value function can potentially require many samples to provide any guarantee about the performance of learned policy Consider the objective function lt, T, and notice again that v itself is not known so we cannot directly optimize for this objective Farahmand et al 017 proposed to search for a model that results in lowest error given all possible value functions belonging to a specific class: LT, ˆT T s, a= s s, a T s s, a vs v F 5 Note that minimizing this objective is shown to be tractable if, for example, F is restricted to the class of exponential functions Observe that the VML objective 5 is similar to the dual of Wasserstein 4, but the main difference is the space of value functions In the next section we show that even the space of value functions are the same under certain conditions 4 Lipschitz Generalized Value Iteration We show that solving for a class of Bellman equations yields a Lipschitz value function Our proof is in the context of GVI Littman & Szepesvári, 1996, which defines Value Iteration Bellman, 1957 with arbitrary backup operators We make use of the following lemmas Lemma 3 Given a non-expansion f : S R: K d S,d R T s s, afs ds K d S,W T Proof Starting from the definition, we write: K d S,d R T s s, afs ds = a a T s s 1, a T s s, a fs ds ds 1, s g T s s 1, a T s s, a gs ds 1, s where K ds,d R g 1 g T s s 1, a T s s, a gs ds = a ds 1, s T s1, a, T s, a = a W ds 1, s = K d S,W T Lemma 4 The following operators are non-expansion K,d R = 1: 1 maxx, meanx ɛ-greedyx := ɛ meanx + 1 ɛmaxx 3 mm β x := log i eβx i n β

4 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning Proof 1 is proven by Littman & Szepesvári 1996 follows from 1: metrics not shown for brevity Kɛ-greedyx = K ɛ meanx + 1 ɛmaxx ɛk meanx + 1 ɛk maxx = 1 Finally, 3 is proven multiple times in the literature sadi & Littman, 017; Nachum et al, 017; Neu et al, 017 lgorithm 1 GVI algorithm Input: initial Qs, a, δ, and choose an operator f repeat diff 0 for each s S do for each a do Q copy Qs, a Qs, a Rs, a+γ T s s, af Qs, ds diff max { diff, Q copy Qs, a } end for end for until diff < δ We now present the main result of this paper Theorem For any choice of backup operator f outlined in Lemma 4, GVI computes a value function with a Lipschitz K constant bounded by d S,d R R 1 γk ds,w T if γk d S,W T < 1 Proof From lgorithm 1, in the nth round of GVI updates we have: Q n+1 s, a Rs, a + γ T s s, af Qn s, ds First observe that: K d S,d R Q n+1 due to Summation Lemma K d S,d R R+γK d S,d R T s s, af Qn s, ds due to Lemma 3 K d S,d R R + γk d S,W T K ds,r f Qn s, due to Composition Lemma 1 K d S,d R R + γk d S,W T K,d R fk d S,d R Q n due to Lemma 4, the non-expansion property of f = K d S,d R R + γk d S,W T K d S,d R Q n Equivalently: K d S,d R Q n+1 K d S,d R R n γk ds,w T i i=0 + γk d S,W T n K ds,d R Q 0 By computing the limit of both sides, we get: lim n K d S,d R Q n lim n K d S,d R R n γk ds,w T i i=0 + lim γk ds,w T n K ds,d n R Q 0 Kd = S,d R R 1 γk ds,w T + 0, where we used the fact that γk ds,w T n = 0 lim n This concludes the proof Now notice that as defined earlier: V n s := f Qn s,, so as a relevant corollary of our theorem we get: K ds,d R vs = lim K d n S,d R V n = lim K d n S,d R f Qn s, lim n K d S,d R Q n Kd S,d R R 1 γk ds,w T That is, solving for the fixed point of this general class of Bellman equations results in a Lipschitz state-value function 5 Equivalence Between VML and Wasserstein We now show the main claim of the paper, namely that minimzing for the VML objective is the same as minimizing the Wasserstein metric Consider again the VML objective: LT, ˆT T s, a= s s, a T s s, a vs v F where F can generally be any class of functions From our theorem, however, the space of value functions F should be restricted to Lipschitz functions Moreover, it is easy to design an MDP and a policy such that a desired Lipschitz value function is attained

5 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning This space L C can then be defined as follows: where L C = {f : K ds,d R f C}, C = K d S,d R R 1 γk ds,w T So we can rewrite the VML objective L as follows: L T, ˆT s, a= fs T s s, a T s s, a f L C = C fs T s s, a f L C C T s s, a = C gs T s s, a T s s, a g L 1 It is clear that a function g that maximizes the Kantorovich-Rubinstein dual form: gs T s s, a T s s, a ds g L 1 := W T s, a, T s, a, will also maximize: L T, ˆT s, a = gs T s s, a T s s, a This is due to the fact that g L 1 g L 1 and so computing absolute value or squaring the term will not change arg max in this case s a result: L T, T s, a = C W T s, a, T s, a This highlights a nice property of Wasserstein, namely that minimizing this metric yields a value-aware model 6 Conclusion and Future Work We showed that the value function of an MDP is Lipschitz This result enabled us to draw a connection between value-aware model-based reinforcement learning and the Wassertein metric We hypothesize that the value function is Lipschitz in a more general sense, and so, further investigation of Lipschitz continuity of value functions should be interesting on its own The second interesting direction relates to design of practical model-learning algorithms that can minimize Wasserstein Two promising directions are the use of generative adversarial networks Goodfellow et al, 014; rjovsky et al, 017 or approximations such as entropic regularization Frogner et al, 015 We leave these two directions for future work References bbeel, Pieter, Quigley, Morgan, and Ng, ndrew Y Using inaccurate models in reinforcement learning In Proceedings of the 3rd international conference on Machine learning, pp 1 8 CM, 006 gostini, lejandro and Celaya, Enric Reinforcement learning with a gaussian mixture model In Neural Networks IJCNN, The 010 International Joint Conference on, pp 1 8 IEEE, 010 rjovsky, Martin, Chintala, Soumith, and Bottou, Léon Wasserstein generative adversarial networks In International Conference on Machine Learning, pp 14 3, 017 sadi, Kavosh and Littman, Michael L n alternative softmax operator for reinforcement learning In Proceedings of the 34th International Conference on Machine Learning, pp 43 5, 017 sadi, Kavosh, Misra, Dipendra, and Littman, Michael L Lipschitz continuity in model-based reinforcement learning arxiv preprint arxiv: , 018 Bagnell, J ndrew and Schneider, Jeff G utonomous helicopter control using reinforcement learning policy search methods In Robotics and utomation, 001 Proceedings 001 ICR IEEE International Conference on, volume, pp IEEE, 001 Bellemare, Marc G, Dabney, Will, and Munos, Rémi distributional perspective on reinforcement learning In International Conference on Machine Learning, pp , 017 Bellman, Richard markovian decision process Journal of Mathematics and Mechanics, pp , 1957 Farahmand, mir-massoud, Barreto, ndre, and Nikovski, Daniel Value-ware Loss Function for Model-based Reinforcement Learning In Proceedings of the 0th International Conference on rtificial Intelligence and Statistics, pp , 017 Frogner, Charlie, Zhang, Chiyuan, Mobahi, Hossein, raya, Mauricio, and Poggio, Tomaso Learning with a wasserstein loss In dvances in Neural Information Processing Systems, pp , 015 Goodfellow, Ian, Pouget-badie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, aron, and Bengio, Yoshua Generative adversarial nets In dvances in neural information processing systems, pp , 014

6 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning Kaelbling, Leslie Pack, Littman, Michael L, and Moore, ndrew W Reinforcement learning: survey J rtif Intell Res, 4:37 85, 1996 Kantorovich, Leonid Vasilevich and Rubinstein, G Sh On a space of completely additive functions Vestnik Leningrad Univ, 137:5 59, 1958 Littman, Michael L and Szepesvári, Csaba generalized reinforcement-learning model: Convergence and applications In Proceedings of the 13th International Conference on Machine Learning, pp , 1996 Nachum, Ofir, Norouzi, Mohammad, Xu, Kelvin, and Schuurmans, Dale Bridging the gap between value and policy based reinforcement learning arxiv preprint arxiv: , 017 Neu, Gergely, Jonsson, nders, and Gómez, Vicenç unified view of entropy-regularized Markov decision processes arxiv preprint arxiv: , 017 Ross, Stéphane and Bagnell, Drew gnostic system identification for model-based reinforcement learning In Proceedings of the 9th International Conference on Machine Learning, ICML 01, Edinburgh, Scotland, UK, June 6 - July 1, 01, 01 Sutton, Richard S and Barto, ndrew G Reinforcement Learning: n Introduction The MIT Press, 1998 Talvitie, Erik Model regularization for stable sample rollouts In Proceedings of the Thirtieth Conference on Uncertainty in rtificial Intelligence, UI 014, Quebec City, Quebec, Canada, July 3-7, 014, pp , 014 Vaserstein, Leonid Nisonovich Markov processes over denumerable products of spaces, describing large systems of automata Problemy Peredachi Informatsii, 53:64 7, 1969 Venkatraman, run, Hebert, Martial, and Bagnell, J ndrew Improving multi-step prediction of learned time series models In Proceedings of the Twenty-Ninth I Conference on rtificial Intelligence, January 5-30, 015, ustin, Texas, US, 015 Villani, Cédric Optimal transport: old and new, volume 338 Springer Science & Business Media, 008

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018 Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:

More information

arxiv: v1 [cs.lg] 20 Apr 2017

arxiv: v1 [cs.lg] 20 Apr 2017 Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).

More information

Singing Voice Separation using Generative Adversarial Networks

Singing Voice Separation using Generative Adversarial Networks Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University

More information

GENERATIVE ADVERSARIAL LEARNING

GENERATIVE ADVERSARIAL LEARNING GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Value-Aware Loss Function for Model Learning in Reinforcement Learning

Value-Aware Loss Function for Model Learning in Reinforcement Learning Journal of Machine Learning Research (EWRL Series) (2016) Submitted 9/16; Published 12/16 Value-Aware Loss Function for Model Learning in Reinforcement Learning Amir-massoud Farahmand farahmand@merl.com

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS Dan Hendrycks University of Chicago dan@ttic.edu Steven Basart University of Chicago xksteven@uchicago.edu ABSTRACT We introduce a

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel & Sergey Levine Department of Electrical Engineering and Computer

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

An Analysis of Categorical Distributional Reinforcement Learning

An Analysis of Categorical Distributional Reinforcement Learning Mark Rowland *1 Marc G. Bellemare Will Dabney Rémi Munos Yee Whye Teh * University of Cambridge, Google Brain, DeepMind Abstract Distributional approaches to value-based reinforcement learning model the

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal

More information

On Reward Function for Survival

On Reward Function for Survival On Reward Function for Survival Naoto Yoshida Tohoku University, Sendai, Japan E-mail: naotoyoshida@pfsl.mech.thoku.ac.jp arxiv:1606.05767v2 [cs.ai] 24 Jul 2016 Abstract Obtaining a survival strategy (policy)

More information

Analysis of Classification-based Policy Iteration Algorithms

Analysis of Classification-based Policy Iteration Algorithms Journal of Machine Learning Research 17 (2016) 1-30 Submitted 12/10; Revised 9/14; Published 4/16 Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric 1 Mohammad Ghavamzadeh

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

GANs, GANs everywhere

GANs, GANs everywhere GANs, GANs everywhere particularly, in High Energy Physics Maxim Borisyak Yandex, NRU Higher School of Economics Generative Generative models Given samples of a random variable X find X such as: P X P

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Direct Policy Iteration with Demonstrations

Direct Policy Iteration with Demonstrations Proceedings of the Twenty-Fourth International Joint onference on rtificial Intelligence (IJI 25) Direct Policy Iteration with Demonstrations Jessica hemali Machine Learning Department arnegie Mellon University

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Theory and Applications of Wasserstein Distance. Yunsheng Bai Mar 2, 2018

Theory and Applications of Wasserstein Distance. Yunsheng Bai Mar 2, 2018 Theory and Applications of Wasserstein Distance Yunsheng Bai Mar 2, 2018 Roadmap 1. 2. 3. Why Study Wasserstein Distance? Elementary Distances Between Two Distributions Applications of Wasserstein Distance

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Testing Stochastic Processes through Reinforcement Learning

Testing Stochastic Processes through Reinforcement Learning Testing Stochastic Processes through Reinforcement Learning François Laviolette Département IFT-GLO Université Laval Québec, Canada francois.laviolette@ift.ulaval.ca Sami Zhioua Département IFT-GLO Université

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Maximum Margin Planning

Maximum Margin Planning Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Feature Selection by Singular Value Decomposition for Reinforcement Learning

Feature Selection by Singular Value Decomposition for Reinforcement Learning Feature Selection by Singular Value Decomposition for Reinforcement Learning Bahram Behzadian 1 Marek Petrik 1 Abstract Linear value function approximation is a standard approach to solving reinforcement

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Training Generative Adversarial Networks Via Turing Test

Training Generative Adversarial Networks Via Turing Test raining enerative Adversarial Networks Via uring est Jianlin Su School of Mathematics Sun Yat-sen University uangdong, China bojone@spaces.ac.cn Abstract In this article, we introduce a new mode for training

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Marek Petrik IBM T.J. Watson Research Center, Yorktown, NY, USA Abstract Approximate dynamic programming is a popular method

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Active Learning of MDP models

Active Learning of MDP models Active Learning of MDP models Mauricio Araya-López, Olivier Buffet, Vincent Thomas, and François Charpillet Nancy Université / INRIA LORIA Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex France

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Michael Herman Tobias Gindele Jörg Wagner Felix Schmitt Wolfram Burgard Robert Bosch GmbH D-70442 Stuttgart, Germany

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Analysis of Classification-based Policy Iteration Algorithms. Abstract

Analysis of Classification-based Policy Iteration Algorithms. Abstract Analysis of Classification-based Policy Iteration Algorithms Analysis of Classification-based Policy Iteration Algorithms Alessandro Lazaric Mohammad Ghavamzadeh Rémi Munos INRIA Lille - Nord Europe, Team

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

A projection algorithm for strictly monotone linear complementarity problems.

A projection algorithm for strictly monotone linear complementarity problems. A projection algorithm for strictly monotone linear complementarity problems. Erik Zawadzki Department of Computer Science epz@cs.cmu.edu Geoffrey J. Gordon Machine Learning Department ggordon@cs.cmu.edu

More information

arxiv: v2 [cs.lg] 8 Aug 2018

arxiv: v2 [cs.lg] 8 Aug 2018 : Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja 1 Aurick Zhou 1 Pieter Abbeel 1 Sergey Levine 1 arxiv:181.129v2 [cs.lg] 8 Aug 218 Abstract Model-free deep

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Time Indexed Hierarchical Relative Entropy Policy Search

Time Indexed Hierarchical Relative Entropy Policy Search Time Indexed Hierarchical Relative Entropy Policy Search Florentin Mehlbeer June 19, 2013 1 / 15 Structure Introduction Reinforcement Learning Relative Entropy Policy Search Hierarchical Relative Entropy

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

MAP Inference for Bayesian Inverse Reinforcement Learning

MAP Inference for Bayesian Inverse Reinforcement Learning MAP Inference for Bayesian Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim bdepartment of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea jdchoi@ai.kaist.ac.kr,

More information