Bayesian Active Learning With Basis Functions
|
|
- Candice Mills
- 5 years ago
- Views:
Transcription
1 Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, / 29
2 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 2 / 29
3 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 3 / 29
4 Motivation: energy storage We buy energy from the spot market when the price is low and sell when the price is high We store energy in a battery, which can be partially charged or discharged at will The price process is continuous and highly volatile (e.g. mean-reverting SDE) 4 / 29
5 Motivation: energy storage The state of the problem at time n is S n = (R n,p n ), where R n is the resource level or amount of energy currently stored P n is the current spot price of energy We make a decision x n representing how much to charge (x n 0) or discharge (x n < 0) The single-period revenue is C (S n,x n ) = P n x n We are maximizing total discounted revenue over a large time horizon 5 / 29
6 Approximate dynamic programming We cannot solve Bellman s equation V (S) = maxc (S,x) + γie [ V ( S ) S ] x S for all states because the price process is continuous Instead, we compute an approximate observation ( ) ˆv n = maxc (S,x) + γv n 1 S M,x (S n,x) x where S M,x maps S n and x to the post-decision state S x,n (Van Roy et al. 1997) 6 / 29
7 The post-decision state S n S x,n S n+1 S x,n+1 x n P ( S n+1 S x,n) In our simple storage problem, the post-decision state is given by R x,n = R n + x n P x,n = P n The transition from P x,n to P n+1 comes from the price process 7 / 29
8 The exploration/exploitation dilemma Exploitation: The decision x n = arg max x C (S n,x) + γv n 1 (S x,n ) seems to be the best under the current VFA Exploration: Because the VFA is inaccurate, we might make a different decision, in the hope of discovering a better strategy Example: If we start in state 0 with V n 1 = 0 for both states, a pure exploitation strategy leads us to stay in state 0 forever. 8 / 29
9 Basis functions We use a parametric approximation V (S x ) = F i=1 θ i φ i (S x ) where φ i are basis functions of the post-decision state: ( φ (S x ) = 1,R x,(r x ) 2,P x,(p x ) 2,R x P x) T. The problem is reduced; we only need to fit a finite set of parameters A single observation now changes our beliefs about the entire value surface 9 / 29
10 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 10 / 29
11 Correlated beliefs using multivariate normal priors Correlations are modeled using a covariance matrix We assume that θ N ( θ 0,C 0) This induces a prior on the value function: IEV (S x ) = ( θ 0) T φ (S x ) ( ( Cov V (S x ),V S x )) = φ (S x ) T C 0 φ (S x ) The quantity V 0 (S x ) = IEV (S x ) represents our initial estimate of the value of S x 11 / 29
12 Main assumption Assumption 1 (Dearden et al. 1998) The ADP observation ˆv n+1 follows the distribution N ( V (S x,n ),σ 2 ε ) and is independent of past observations. Standard assumption in optimal learning (e.g. bandit problems), but does not hold in ADP The observation ˆv n+1 is biased by definition: ˆv n+1 = maxc ( S n+1,x ) ( + γv n S M,x ( S n+1,x )) x 12 / 29
13 Learning with correlated beliefs Once we choose action x, Assumption 1 allows us to update θ n+1 = θ n ˆv n+1 φ (S x,n ) T θ n σ 2 ε + φ (S x,n )C n φ (S x,n ) C n φ (S x,n ) C n+1 = C n C n φ (S x,n )φ (S x,n ) T C n σ 2 ε + φ (S x,n )C n φ (S x,n ) resulting in a new induced prior on the value function V n+1 (S x ) = φ (S x ) T θ n+1 Even if C 0 is a diagonal matrix, the updating equations will create empirical covariances 13 / 29
14 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 14 / 29
15 The knowledge gradient policy Previously studied in ranking and selection (Gupta & Miescke 1996) and multi-armed bandits (Ryzhov et al. 2010) One-period look-ahead policy: how much will the next decision improve our expected objective value? KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We look one step ahead: on average, what will the VFA become after the next decision? 15 / 29
16 Intuition behind KG idea ˆv n+1 S n Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
17 Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
18 Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
19 Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
20 Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
21 Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
22 Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29
23 Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29
24 Derivation Bayesian analysis (Frazier et al. 2009) shows that the conditional distribution of θ n+1, given decision x, is where Z N (0,1) θ n+1 θ n + C n φ (S x,n ) Z σε 2 + φ (S x,n ) T C n φ (S x,n ) This Z is not indexed by S, yielding IE n x max x Q n+1 ( S n+1,x ) = IEmax(a n x x + bn x Z) 18 / 29
25 Final form of knowledge gradient policy This expectation can be computed, and the KG decision becomes x KG,n = arg maxc (S n,x) + γφ (S x,n ) T θ n x + γ P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) S n+1 This is Bellman s equation plus a KG term representing the value of information The transition probabilities are difficult to compute, but we can simulate K transitions from S x,n to S n+1 and approximate: S n+1 P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) 1 K K k ν KG,n ( S x,n,s n+1 ) k 19 / 29
26 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 20 / 29
27 Setup of experiments Two-dimensional state variable: S n = (R n,p n ) Price variable evolves according to mean-reverting SDE State is continuous, but we discretize the VFA Two performance measures: Online: How much revenue did we collect while training the VFA? Offline: How much revenue can we collect after the VFA is fixed? Two forms of KG: Online: Maximize Bellman s equation plus the value of information Offline: Maximize the value of information only 21 / 29
28 Numerical results Lookup VFA Basis VFA Offline objective Online objective Mean Avg. SE Mean Avg. SE Offline KG Online KG BQL Bayes EG ADP EG Offline KG Online KG Basis EG Table: Online and offline values (averaged over 1000 sample paths) achieved by different methods in 150 iterations. 22 / 29
29 Online performance of KG with basis functions Early observations drastically change the VFA, leading to volatile online results However, as the VFA improves, results get better quickly 23 / 29
30 Policy produced by offline KG After 150 iterations, the final VFA produces a buy low, sell high policy 24 / 29
31 Implementation issues We can obtain a very good fit, but we need to pick the problem parameters (σε 2 and C 0 ) carefully Some parameter values can cause the estimates θ n to diverge This is a property of basis functions, not KG A comment on basis functions (Sutton et al. 2008) Counterexamples have been known for many yearsin which Q-learning s parameters diverge to infinity for any positive step sizethe need [to handle large state spaces] is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. 25 / 29
32 Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 26 / 29
33 Conclusions We have proposed a new policy for making decisions in ADP with parametric value function approximations The policy uses a Bayesian belief structure to calculate the value of information obtained from a single decision We use the knowledge gradient approach to balance exploration and exploitation in both online and offline settings The policy produces very good fits, but requires tuning 27 / 29
34 References Dearden, R., Friedman, N. & Russell, S. (1998) Bayesian Q-learning. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing 21:4, Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54: / 29
35 References Ryzhov, I.O., Powell, W. & Frazier, P.I. (2010) The knowledge gradient algorithm for a general class of online learning problems. Submitted to Operations Research. Sutton, R., Szepesvári, C. & Maei, H. (2008) A convergent O(n) algorithm for off-policy temporal difference learning with linear function approximation. Advances in Neural Information Processing Systems 21: Van Roy, B., Bertsekas, D., Lee, Y. & Tsitsiklis, J. (1997) A neuro-dynamic programming approach to retailer inventory management. Proceedings of the 36th IEEE Conference on Decision and Control, pp / 29
Bayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov and Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544 Abstract A common technique for
More informationThe knowledge gradient method for multi-armed bandit problems
The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton
More informationBayesian exploration for approximate dynamic programming
Bayesian exploration for approximate dynamic programming Ilya O. Ryzhov Martijn R.K. Mes Warren B. Powell Gerald A. van den Berg December 18, 2017 Abstract Approximate dynamic programming (ADP) is a general
More informationThe Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks
The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016
More informationOn the robustness of a one-period look-ahead policy in multi-armed bandit problems
Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationOn the robustness of a one-period look-ahead policy in multi-armed bandit problems
Procedia Computer Science 00 (200) (202) 0 635 644 International Conference on Computational Science, ICCS 200 Procedia Computer Science www.elsevier.com/locate/procedia On the robustness of a one-period
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationOPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING
CHAPTER 18 OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING Warren B. Powell and Ilya O. Ryzhov Princeton University, University of Maryland 18.1 INTRODUCTION Approimate dynamic programming (ADP) has
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationBayesian Contextual Multi-armed Bandits
Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationAPPROXIMATE dynamic programming (ADP) has
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 3, MARCH 2015 743 A New Optimal Stepsize for Approximate Dynamic Programming Ilya O. Ryzhov, Member, IEEE, Peter I. Frazier, and Warren B. Powell, Member,
More informationSeminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time
Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time 26.11.2015 Fachbereich Informatik Knowledge Engineering Group David Fischer 1 Table of Contents Problem and Motivation Algorithm
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationLecture 4: Misc. Topics and Reinforcement Learning
Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationLearning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach
More informationLinear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem
ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from
More informationAPPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS
APPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS INTRODUCTION WARREN B. POWELL Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey Approximate dynamic
More informationReinforcement Learning (1)
Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationLearning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationLinear Least-squares Dyna-style Planning
Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for
More informationFast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationTHE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS
Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tolk, R. Hill, and M. E. Kuhl, eds. THE KNOWLEDGE GRADENT ALGORTHM USNG LOCALLY PARAMETRC APPROXMATONS Bolong Cheng Electrical
More informationAn Analytic Solution to Discrete Bayesian Reinforcement Learning
An Analytic Solution to Discrete Bayesian Reinforcement Learning Pascal Poupart (U of Waterloo) Nikos Vlassis (U of Amsterdam) Jesse Hoey (U of Toronto) Kevin Regan (U of Waterloo) 1 Motivation Automated
More informationBayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison
Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Michaël Castronovo University of Liège, Institut Montefiore, B28, B-4000 Liège, BELGIUM Damien Ernst
More informationSparse Linear Contextual Bandits via Relevance Vector Machines
Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationarxiv: v1 [cs.lg] 23 Oct 2017
Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1
More information1 [15 points] Search Strategies
Probabilistic Foundations of Artificial Intelligence Final Exam Date: 29 January 2013 Time limit: 120 minutes Number of pages: 12 You can use the back of the pages if you run out of space. strictly forbidden.
More informationEmphatic Temporal-Difference Learning
Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationReinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil
Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationExploration and Exploitation in Bayes Sequential Decision Problems
Exploration and Exploitation in Bayes Sequential Decision Problems James Anthony Edwards Submitted for the degree of Doctor of Philosophy at Lancaster University. September 2016 Abstract Bayes sequential
More informationHierarchical Knowledge Gradient for Sequential Sampling
Journal of Machine Learning Research () Submitted ; Published Hierarchical Knowledge Gradient for Sequential Sampling Martijn R.K. Mes Department of Operational Methods for Production and Logistics University
More informationProximal Gradient Temporal Difference Learning Algorithms
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Proximal Gradient Temporal Difference Learning Algorithms Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar
More informationChapter 8: Generalization and Function Approximation
Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview
More informationEfficient Learning in Linearly Solvable MDP Models
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota
More informationApproximate Dynamic Programming for High Dimensional Resource Allocation Problems
Approximate Dynamic Programming for High Dimensional Resource Allocation Problems Warren B. Powell Abraham George Belgacem Bouzaiene-Ayari Hugo P. Simao Department of Operations Research and Financial
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationApproximate Dynamic Programming for Energy Storage with New Results on Instrumental Variables and Projected Bellman Errors
Submitted to Operations Research manuscript (Please, provide the mansucript number!) Approximate Dynamic Programming for Energy Storage with New Results on Instrumental Variables and Projected Bellman
More informationTHE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING
THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING INTRODUCTION WARREN B. POWELL Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey There is a wide range
More informationWhat you should know about approximate dynamic programming
What you should know about approximate dynamic programming Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 December 16, 2008 Abstract
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationApproximate dynamic programming for stochastic reachability
Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationDual Temporal Difference Learning
Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada
More informationGeneralization and Function Approximation
Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationModel-Based Reinforcement Learning with Continuous States and Actions
Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks
More informationApproximate Universal Artificial Intelligence
Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationMulti-Attribute Bayesian Optimization under Utility Uncertainty
Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu
More informationHuman-level control through deep reinforcement. Liia Butler
Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationA Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation
A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence
More informationOptimal Learning with Non-Gaussian Rewards. Zi Ding
ABSTRACT Title of dissertation: Optimal Learning with Non-Gaussian Rewards Zi Ding, Doctor of Philosophy, 214 Dissertation directed by: Professor Ilya O. Ryzhov Robert. H. Smith School of Business In this
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationTutorial: Stochastic Optimization in Energy
Tutorial: Stochastic Optimization in Energy FERC, Washington, D.C. August 6, 2014 Warren B. Powell CASTLE Labs Princeton University http://www.castlelab.princeton.edu Warren B. Powell, 2014 Slide 1 Mission
More informationLecture 2: Learning from Evaluative Feedback. or Bandit Problems
Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those
More informationQ-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading
Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Marco CORAZZA (corazza@unive.it) Department of Economics Ca' Foscari University of Venice CONFERENCE ON COMPUTATIONAL
More informationOptimal Control. McGill COMP 765 Oct 3 rd, 2017
Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps
More informationLecture 10 - Planning under Uncertainty (III)
Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement
More informationBiasing Approximate Dynamic Programming with a Lower Discount Factor
Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower
More informationReinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies
Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University
More informationParameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets
Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets Matthias Katzfuß Advisor: Dr. Noel Cressie Department of Statistics The Ohio State University
More informationSymbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning
Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More information