Bayesian Active Learning With Basis Functions

Size: px
Start display at page:

Download "Bayesian Active Learning With Basis Functions"

Transcription

1 Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, / 29

2 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 2 / 29

3 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 3 / 29

4 Motivation: energy storage We buy energy from the spot market when the price is low and sell when the price is high We store energy in a battery, which can be partially charged or discharged at will The price process is continuous and highly volatile (e.g. mean-reverting SDE) 4 / 29

5 Motivation: energy storage The state of the problem at time n is S n = (R n,p n ), where R n is the resource level or amount of energy currently stored P n is the current spot price of energy We make a decision x n representing how much to charge (x n 0) or discharge (x n < 0) The single-period revenue is C (S n,x n ) = P n x n We are maximizing total discounted revenue over a large time horizon 5 / 29

6 Approximate dynamic programming We cannot solve Bellman s equation V (S) = maxc (S,x) + γie [ V ( S ) S ] x S for all states because the price process is continuous Instead, we compute an approximate observation ( ) ˆv n = maxc (S,x) + γv n 1 S M,x (S n,x) x where S M,x maps S n and x to the post-decision state S x,n (Van Roy et al. 1997) 6 / 29

7 The post-decision state S n S x,n S n+1 S x,n+1 x n P ( S n+1 S x,n) In our simple storage problem, the post-decision state is given by R x,n = R n + x n P x,n = P n The transition from P x,n to P n+1 comes from the price process 7 / 29

8 The exploration/exploitation dilemma Exploitation: The decision x n = arg max x C (S n,x) + γv n 1 (S x,n ) seems to be the best under the current VFA Exploration: Because the VFA is inaccurate, we might make a different decision, in the hope of discovering a better strategy Example: If we start in state 0 with V n 1 = 0 for both states, a pure exploitation strategy leads us to stay in state 0 forever. 8 / 29

9 Basis functions We use a parametric approximation V (S x ) = F i=1 θ i φ i (S x ) where φ i are basis functions of the post-decision state: ( φ (S x ) = 1,R x,(r x ) 2,P x,(p x ) 2,R x P x) T. The problem is reduced; we only need to fit a finite set of parameters A single observation now changes our beliefs about the entire value surface 9 / 29

10 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 10 / 29

11 Correlated beliefs using multivariate normal priors Correlations are modeled using a covariance matrix We assume that θ N ( θ 0,C 0) This induces a prior on the value function: IEV (S x ) = ( θ 0) T φ (S x ) ( ( Cov V (S x ),V S x )) = φ (S x ) T C 0 φ (S x ) The quantity V 0 (S x ) = IEV (S x ) represents our initial estimate of the value of S x 11 / 29

12 Main assumption Assumption 1 (Dearden et al. 1998) The ADP observation ˆv n+1 follows the distribution N ( V (S x,n ),σ 2 ε ) and is independent of past observations. Standard assumption in optimal learning (e.g. bandit problems), but does not hold in ADP The observation ˆv n+1 is biased by definition: ˆv n+1 = maxc ( S n+1,x ) ( + γv n S M,x ( S n+1,x )) x 12 / 29

13 Learning with correlated beliefs Once we choose action x, Assumption 1 allows us to update θ n+1 = θ n ˆv n+1 φ (S x,n ) T θ n σ 2 ε + φ (S x,n )C n φ (S x,n ) C n φ (S x,n ) C n+1 = C n C n φ (S x,n )φ (S x,n ) T C n σ 2 ε + φ (S x,n )C n φ (S x,n ) resulting in a new induced prior on the value function V n+1 (S x ) = φ (S x ) T θ n+1 Even if C 0 is a diagonal matrix, the updating equations will create empirical covariances 13 / 29

14 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 14 / 29

15 The knowledge gradient policy Previously studied in ranking and selection (Gupta & Miescke 1996) and multi-armed bandits (Ryzhov et al. 2010) One-period look-ahead policy: how much will the next decision improve our expected objective value? KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We look one step ahead: on average, what will the VFA become after the next decision? 15 / 29

16 Intuition behind KG idea ˆv n+1 S n Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

17 Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

18 Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

19 Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

20 Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

21 Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

22 Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29

23 Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29

24 Derivation Bayesian analysis (Frazier et al. 2009) shows that the conditional distribution of θ n+1, given decision x, is where Z N (0,1) θ n+1 θ n + C n φ (S x,n ) Z σε 2 + φ (S x,n ) T C n φ (S x,n ) This Z is not indexed by S, yielding IE n x max x Q n+1 ( S n+1,x ) = IEmax(a n x x + bn x Z) 18 / 29

25 Final form of knowledge gradient policy This expectation can be computed, and the KG decision becomes x KG,n = arg maxc (S n,x) + γφ (S x,n ) T θ n x + γ P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) S n+1 This is Bellman s equation plus a KG term representing the value of information The transition probabilities are difficult to compute, but we can simulate K transitions from S x,n to S n+1 and approximate: S n+1 P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) 1 K K k ν KG,n ( S x,n,s n+1 ) k 19 / 29

26 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 20 / 29

27 Setup of experiments Two-dimensional state variable: S n = (R n,p n ) Price variable evolves according to mean-reverting SDE State is continuous, but we discretize the VFA Two performance measures: Online: How much revenue did we collect while training the VFA? Offline: How much revenue can we collect after the VFA is fixed? Two forms of KG: Online: Maximize Bellman s equation plus the value of information Offline: Maximize the value of information only 21 / 29

28 Numerical results Lookup VFA Basis VFA Offline objective Online objective Mean Avg. SE Mean Avg. SE Offline KG Online KG BQL Bayes EG ADP EG Offline KG Online KG Basis EG Table: Online and offline values (averaged over 1000 sample paths) achieved by different methods in 150 iterations. 22 / 29

29 Online performance of KG with basis functions Early observations drastically change the VFA, leading to volatile online results However, as the VFA improves, results get better quickly 23 / 29

30 Policy produced by offline KG After 150 iterations, the final VFA produces a buy low, sell high policy 24 / 29

31 Implementation issues We can obtain a very good fit, but we need to pick the problem parameters (σε 2 and C 0 ) carefully Some parameter values can cause the estimates θ n to diverge This is a property of basis functions, not KG A comment on basis functions (Sutton et al. 2008) Counterexamples have been known for many yearsin which Q-learning s parameters diverge to infinity for any positive step sizethe need [to handle large state spaces] is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. 25 / 29

32 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 26 / 29

33 Conclusions We have proposed a new policy for making decisions in ADP with parametric value function approximations The policy uses a Bayesian belief structure to calculate the value of information obtained from a single decision We use the knowledge gradient approach to balance exploration and exploitation in both online and offline settings The policy produces very good fits, but requires tuning 27 / 29

34 References Dearden, R., Friedman, N. & Russell, S. (1998) Bayesian Q-learning. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing 21:4, Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54: / 29

35 References Ryzhov, I.O., Powell, W. & Frazier, P.I. (2010) The knowledge gradient algorithm for a general class of online learning problems. Submitted to Operations Research. Sutton, R., Szepesvári, C. & Maei, H. (2008) A convergent O(n) algorithm for off-policy temporal difference learning with linear function approximation. Advances in Neural Information Processing Systems 21: Van Roy, B., Bertsekas, D., Lee, Y. & Tsitsiklis, J. (1997) A neuro-dynamic programming approach to retailer inventory management. Proceedings of the 36th IEEE Conference on Decision and Control, pp / 29

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov and Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544 Abstract A common technique for

More information

The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton

More information

Bayesian exploration for approximate dynamic programming

Bayesian exploration for approximate dynamic programming Bayesian exploration for approximate dynamic programming Ilya O. Ryzhov Martijn R.K. Mes Warren B. Powell Gerald A. van den Berg December 18, 2017 Abstract Approximate dynamic programming (ADP) is a general

More information

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science 00 (200) (202) 0 635 644 International Conference on Computational Science, ICCS 200 Procedia Computer Science www.elsevier.com/locate/procedia On the robustness of a one-period

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING

OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING CHAPTER 18 OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING Warren B. Powell and Ilya O. Ryzhov Princeton University, University of Maryland 18.1 INTRODUCTION Approimate dynamic programming (ADP) has

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

APPROXIMATE dynamic programming (ADP) has

APPROXIMATE dynamic programming (ADP) has IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 3, MARCH 2015 743 A New Optimal Stepsize for Approximate Dynamic Programming Ilya O. Ryzhov, Member, IEEE, Peter I. Frazier, and Warren B. Powell, Member,

More information

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time 26.11.2015 Fachbereich Informatik Knowledge Engineering Group David Fischer 1 Table of Contents Problem and Motivation Algorithm

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Lecture 4: Misc. Topics and Reinforcement Learning

Lecture 4: Misc. Topics and Reinforcement Learning Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach

More information

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from

More information

APPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS

APPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS APPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS INTRODUCTION WARREN B. POWELL Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey Approximate dynamic

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS

THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tolk, R. Hill, and M. E. Kuhl, eds. THE KNOWLEDGE GRADENT ALGORTHM USNG LOCALLY PARAMETRC APPROXMATONS Bolong Cheng Electrical

More information

An Analytic Solution to Discrete Bayesian Reinforcement Learning

An Analytic Solution to Discrete Bayesian Reinforcement Learning An Analytic Solution to Discrete Bayesian Reinforcement Learning Pascal Poupart (U of Waterloo) Nikos Vlassis (U of Amsterdam) Jesse Hoey (U of Toronto) Kevin Regan (U of Waterloo) 1 Motivation Automated

More information

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Michaël Castronovo University of Liège, Institut Montefiore, B28, B-4000 Liège, BELGIUM Damien Ernst

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

1 [15 points] Search Strategies

1 [15 points] Search Strategies Probabilistic Foundations of Artificial Intelligence Final Exam Date: 29 January 2013 Time limit: 120 minutes Number of pages: 12 You can use the back of the pages if you run out of space. strictly forbidden.

More information

Emphatic Temporal-Difference Learning

Emphatic Temporal-Difference Learning Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Exploration and Exploitation in Bayes Sequential Decision Problems

Exploration and Exploitation in Bayes Sequential Decision Problems Exploration and Exploitation in Bayes Sequential Decision Problems James Anthony Edwards Submitted for the degree of Doctor of Philosophy at Lancaster University. September 2016 Abstract Bayes sequential

More information

Hierarchical Knowledge Gradient for Sequential Sampling

Hierarchical Knowledge Gradient for Sequential Sampling Journal of Machine Learning Research () Submitted ; Published Hierarchical Knowledge Gradient for Sequential Sampling Martijn R.K. Mes Department of Operational Methods for Production and Logistics University

More information

Proximal Gradient Temporal Difference Learning Algorithms

Proximal Gradient Temporal Difference Learning Algorithms Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Proximal Gradient Temporal Difference Learning Algorithms Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

Approximate Dynamic Programming for High Dimensional Resource Allocation Problems

Approximate Dynamic Programming for High Dimensional Resource Allocation Problems Approximate Dynamic Programming for High Dimensional Resource Allocation Problems Warren B. Powell Abraham George Belgacem Bouzaiene-Ayari Hugo P. Simao Department of Operations Research and Financial

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Approximate Dynamic Programming for Energy Storage with New Results on Instrumental Variables and Projected Bellman Errors

Approximate Dynamic Programming for Energy Storage with New Results on Instrumental Variables and Projected Bellman Errors Submitted to Operations Research manuscript (Please, provide the mansucript number!) Approximate Dynamic Programming for Energy Storage with New Results on Instrumental Variables and Projected Bellman

More information

THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING

THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING INTRODUCTION WARREN B. POWELL Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey There is a wide range

More information

What you should know about approximate dynamic programming

What you should know about approximate dynamic programming What you should know about approximate dynamic programming Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 December 16, 2008 Abstract

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Approximate Universal Artificial Intelligence

Approximate Universal Artificial Intelligence Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Human-level control through deep reinforcement. Liia Butler

Human-level control through deep reinforcement. Liia Butler Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence

More information

Optimal Learning with Non-Gaussian Rewards. Zi Ding

Optimal Learning with Non-Gaussian Rewards. Zi Ding ABSTRACT Title of dissertation: Optimal Learning with Non-Gaussian Rewards Zi Ding, Doctor of Philosophy, 214 Dissertation directed by: Professor Ilya O. Ryzhov Robert. H. Smith School of Business In this

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Tutorial: Stochastic Optimization in Energy

Tutorial: Stochastic Optimization in Energy Tutorial: Stochastic Optimization in Energy FERC, Washington, D.C. August 6, 2014 Warren B. Powell CASTLE Labs Princeton University http://www.castlelab.princeton.edu Warren B. Powell, 2014 Slide 1 Mission

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Marco CORAZZA (corazza@unive.it) Department of Economics Ca' Foscari University of Venice CONFERENCE ON COMPUTATIONAL

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets Matthias Katzfuß Advisor: Dr. Noel Cressie Department of Statistics The Ohio State University

More information

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information