Bayesian Active Learning With Basis Functions

Similar documents
Bayesian Active Learning With Basis Functions

The knowledge gradient method for multi-armed bandit problems

Bayesian exploration for approximate dynamic programming

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

Elements of Reinforcement Learning

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

Reinforcement learning

OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING

Basics of reinforcement learning

Prioritized Sweeping Converges to the Optimal Value Function

Reinforcement Learning. Yishay Mansour Tel-Aviv University

ARTIFICIAL INTELLIGENCE. Reinforcement learning

CS599 Lecture 1 Introduction To RL

Bayesian Contextual Multi-armed Bandits

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

APPROXIMATE dynamic programming (ADP) has

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

ilstd: Eligibility Traces and Convergence Analysis

The convergence limit of the temporal difference learning

The Art of Sequential Optimization via Simulations

Lecture 4: Misc. Topics and Reinforcement Learning

Lecture 1: March 7, 2018

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Q-learning. Tambet Matiisen

1 MDP Value Iteration Algorithm

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

APPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS

Reinforcement Learning (1)

Lecture 4: Approximate dynamic programming

Reinforcement Learning Active Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Lecture 7: Value Function Approximation

Linear Least-squares Dyna-style Planning

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Decision Theory: Markov Decision Processes

THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS

An Analytic Solution to Discrete Bayesian Reinforcement Learning

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison

Sparse Linear Contextual Bandits via Relevance Vector Machines

6 Reinforcement Learning

arxiv: v1 [cs.lg] 23 Oct 2017

1 [15 points] Search Strategies

Emphatic Temporal-Difference Learning

Artificial Intelligence & Sequential Decision Problems

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Approximate Dynamic Programming

Reinforcement Learning and NLP

Exploration and Exploitation in Bayes Sequential Decision Problems

Hierarchical Knowledge Gradient for Sequential Sampling

Proximal Gradient Temporal Difference Learning Algorithms

Chapter 8: Generalization and Function Approximation

Efficient Learning in Linearly Solvable MDP Models

Approximate Dynamic Programming for High Dimensional Resource Allocation Problems

Evaluation of multi armed bandit algorithms and empirical algorithm

Unsupervised Learning with Permuted Data

Reinforcement learning an introduction

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Approximate Dynamic Programming for Energy Storage with New Results on Instrumental Variables and Projected Bellman Errors

THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING

What you should know about approximate dynamic programming

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Approximate dynamic programming for stochastic reachability

Learning Tetris. 1 Tetris. February 3, 2009

Dual Temporal Difference Learning

Generalization and Function Approximation

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Q-Learning in Continuous State Action Spaces

Model-Based Reinforcement Learning with Continuous States and Actions

Approximate Universal Artificial Intelligence

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Human-level control through deep reinforcement. Liia Butler

Online Learning and Sequential Decision Making

Approximate Dynamic Programming

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

Optimal Learning with Non-Gaussian Rewards. Zi Ding

A Gentle Introduction to Reinforcement Learning

Decision Theory: Q-Learning

Grundlagen der Künstlichen Intelligenz

State Space Abstractions for Reinforcement Learning

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Tutorial: Stochastic Optimization in Energy

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Lecture 10 - Planning under Uncertainty (III)

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Introduction to Reinforcement Learning

Transcription:

Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 2 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 3 / 29

Motivation: energy storage We buy energy from the spot market when the price is low and sell when the price is high We store energy in a battery, which can be partially charged or discharged at will The price process is continuous and highly volatile (e.g. mean-reverting SDE) 4 / 29

Motivation: energy storage The state of the problem at time n is S n = (R n,p n ), where R n is the resource level or amount of energy currently stored P n is the current spot price of energy We make a decision x n representing how much to charge (x n 0) or discharge (x n < 0) The single-period revenue is C (S n,x n ) = P n x n We are maximizing total discounted revenue over a large time horizon 5 / 29

Approximate dynamic programming We cannot solve Bellman s equation V (S) = maxc (S,x) + γie [ V ( S ) S ] x S for all states because the price process is continuous Instead, we compute an approximate observation ( ) ˆv n = maxc (S,x) + γv n 1 S M,x (S n,x) x where S M,x maps S n and x to the post-decision state S x,n (Van Roy et al. 1997) 6 / 29

The post-decision state S n S x,n S n+1 S x,n+1 x n P ( S n+1 S x,n) In our simple storage problem, the post-decision state is given by R x,n = R n + x n P x,n = P n The transition from P x,n to P n+1 comes from the price process 7 / 29

The exploration/exploitation dilemma Exploitation: The decision x n = arg max x C (S n,x) + γv n 1 (S x,n ) seems to be the best under the current VFA Exploration: Because the VFA is inaccurate, we might make a different decision, in the hope of discovering a better strategy 1 0 1 0 +20 0 Example: If we start in state 0 with V n 1 = 0 for both states, a pure exploitation strategy leads us to stay in state 0 forever. 8 / 29

Basis functions We use a parametric approximation V (S x ) = F i=1 θ i φ i (S x ) where φ i are basis functions of the post-decision state: ( φ (S x ) = 1,R x,(r x ) 2,P x,(p x ) 2,R x P x) T. The problem is reduced; we only need to fit a finite set of parameters A single observation now changes our beliefs about the entire value surface 9 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 10 / 29

Correlated beliefs using multivariate normal priors Correlations are modeled using a covariance matrix We assume that θ N ( θ 0,C 0) This induces a prior on the value function: IEV (S x ) = ( θ 0) T φ (S x ) ( ( Cov V (S x ),V S x )) = φ (S x ) T C 0 φ (S x ) The quantity V 0 (S x ) = IEV (S x ) represents our initial estimate of the value of S x 11 / 29

Main assumption Assumption 1 (Dearden et al. 1998) The ADP observation ˆv n+1 follows the distribution N ( V (S x,n ),σ 2 ε ) and is independent of past observations. Standard assumption in optimal learning (e.g. bandit problems), but does not hold in ADP The observation ˆv n+1 is biased by definition: ˆv n+1 = maxc ( S n+1,x ) ( + γv n S M,x ( S n+1,x )) x 12 / 29

Learning with correlated beliefs Once we choose action x, Assumption 1 allows us to update θ n+1 = θ n ˆv n+1 φ (S x,n ) T θ n σ 2 ε + φ (S x,n )C n φ (S x,n ) C n φ (S x,n ) C n+1 = C n C n φ (S x,n )φ (S x,n ) T C n σ 2 ε + φ (S x,n )C n φ (S x,n ) resulting in a new induced prior on the value function V n+1 (S x ) = φ (S x ) T θ n+1 Even if C 0 is a diagonal matrix, the updating equations will create empirical covariances 13 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 14 / 29

The knowledge gradient policy Previously studied in ranking and selection (Gupta & Miescke 1996) and multi-armed bandits (Ryzhov et al. 2010) One-period look-ahead policy: how much will the next decision improve our expected objective value? KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We look one step ahead: on average, what will the VFA become after the next decision? 15 / 29

Intuition behind KG idea ˆv n+1 S n Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29

Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29

Derivation Bayesian analysis (Frazier et al. 2009) shows that the conditional distribution of θ n+1, given decision x, is where Z N (0,1) θ n+1 θ n + C n φ (S x,n ) Z σε 2 + φ (S x,n ) T C n φ (S x,n ) This Z is not indexed by S, yielding IE n x max x Q n+1 ( S n+1,x ) = IEmax(a n x x + bn x Z) 18 / 29

Final form of knowledge gradient policy This expectation can be computed, and the KG decision becomes x KG,n = arg maxc (S n,x) + γφ (S x,n ) T θ n x + γ P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) S n+1 This is Bellman s equation plus a KG term representing the value of information The transition probabilities are difficult to compute, but we can simulate K transitions from S x,n to S n+1 and approximate: S n+1 P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) 1 K K k ν KG,n ( S x,n,s n+1 ) k 19 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 20 / 29

Setup of experiments Two-dimensional state variable: S n = (R n,p n ) Price variable evolves according to mean-reverting SDE State is continuous, but we discretize the VFA Two performance measures: Online: How much revenue did we collect while training the VFA? Offline: How much revenue can we collect after the VFA is fixed? Two forms of KG: Online: Maximize Bellman s equation plus the value of information Offline: Maximize the value of information only 21 / 29

Numerical results Lookup VFA Basis VFA Offline objective Online objective Mean Avg. SE Mean Avg. SE Offline KG 210.43 0.33-277.37 15.90 Online KG 79.36 0.23 160.28 5.43 BQL 133.65 0.31 76.03 1.87 Bayes EG 85.65 0.31 67.10 3.00 ADP EG 154.47 0.35 7.09 2.57 Offline KG 1136.20 3.54-342.23 19.96 Online KG 871.1285 3.15 44.58 27.71 Basis EG -475.54 2.30-329.03 25.31 Table: Online and offline values (averaged over 1000 sample paths) achieved by different methods in 150 iterations. 22 / 29

Online performance of KG with basis functions Early observations drastically change the VFA, leading to volatile online results However, as the VFA improves, results get better quickly 23 / 29

Policy produced by offline KG After 150 iterations, the final VFA produces a buy low, sell high policy 24 / 29

Implementation issues We can obtain a very good fit, but we need to pick the problem parameters (σε 2 and C 0 ) carefully Some parameter values can cause the estimates θ n to diverge This is a property of basis functions, not KG A comment on basis functions (Sutton et al. 2008) Counterexamples have been known for many yearsin which Q-learning s parameters diverge to infinity for any positive step sizethe need [to handle large state spaces] is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. 25 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 26 / 29

Conclusions We have proposed a new policy for making decisions in ADP with parametric value function approximations The policy uses a Bayesian belief structure to calculate the value of information obtained from a single decision We use the knowledge gradient approach to balance exploration and exploitation in both online and offline settings The policy produces very good fits, but requires tuning 27 / 29

References Dearden, R., Friedman, N. & Russell, S. (1998) Bayesian Q-learning. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp. 761-768. Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing 21:4, 599 613. Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54:229-244. 28 / 29

References Ryzhov, I.O., Powell, W. & Frazier, P.I. (2010) The knowledge gradient algorithm for a general class of online learning problems. Submitted to Operations Research. Sutton, R., Szepesvári, C. & Maei, H. (2008) A convergent O(n) algorithm for off-policy temporal difference learning with linear function approximation. Advances in Neural Information Processing Systems 21:1609 1616. Van Roy, B., Bertsekas, D., Lee, Y. & Tsitsiklis, J. (1997) A neuro-dynamic programming approach to retailer inventory management. Proceedings of the 36th IEEE Conference on Decision and Control, pp. 4052 4057. 29 / 29