Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29
Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 2 / 29
Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 3 / 29
Motivation: energy storage We buy energy from the spot market when the price is low and sell when the price is high We store energy in a battery, which can be partially charged or discharged at will The price process is continuous and highly volatile (e.g. mean-reverting SDE) 4 / 29
Motivation: energy storage The state of the problem at time n is S n = (R n,p n ), where R n is the resource level or amount of energy currently stored P n is the current spot price of energy We make a decision x n representing how much to charge (x n 0) or discharge (x n < 0) The single-period revenue is C (S n,x n ) = P n x n We are maximizing total discounted revenue over a large time horizon 5 / 29
Approximate dynamic programming We cannot solve Bellman s equation V (S) = maxc (S,x) + γie [ V ( S ) S ] x S for all states because the price process is continuous Instead, we compute an approximate observation ( ) ˆv n = maxc (S,x) + γv n 1 S M,x (S n,x) x where S M,x maps S n and x to the post-decision state S x,n (Van Roy et al. 1997) 6 / 29
The post-decision state S n S x,n S n+1 S x,n+1 x n P ( S n+1 S x,n) In our simple storage problem, the post-decision state is given by R x,n = R n + x n P x,n = P n The transition from P x,n to P n+1 comes from the price process 7 / 29
The exploration/exploitation dilemma Exploitation: The decision x n = arg max x C (S n,x) + γv n 1 (S x,n ) seems to be the best under the current VFA Exploration: Because the VFA is inaccurate, we might make a different decision, in the hope of discovering a better strategy 1 0 1 0 +20 0 Example: If we start in state 0 with V n 1 = 0 for both states, a pure exploitation strategy leads us to stay in state 0 forever. 8 / 29
Basis functions We use a parametric approximation V (S x ) = F i=1 θ i φ i (S x ) where φ i are basis functions of the post-decision state: ( φ (S x ) = 1,R x,(r x ) 2,P x,(p x ) 2,R x P x) T. The problem is reduced; we only need to fit a finite set of parameters A single observation now changes our beliefs about the entire value surface 9 / 29
Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 10 / 29
Correlated beliefs using multivariate normal priors Correlations are modeled using a covariance matrix We assume that θ N ( θ 0,C 0) This induces a prior on the value function: IEV (S x ) = ( θ 0) T φ (S x ) ( ( Cov V (S x ),V S x )) = φ (S x ) T C 0 φ (S x ) The quantity V 0 (S x ) = IEV (S x ) represents our initial estimate of the value of S x 11 / 29
Main assumption Assumption 1 (Dearden et al. 1998) The ADP observation ˆv n+1 follows the distribution N ( V (S x,n ),σ 2 ε ) and is independent of past observations. Standard assumption in optimal learning (e.g. bandit problems), but does not hold in ADP The observation ˆv n+1 is biased by definition: ˆv n+1 = maxc ( S n+1,x ) ( + γv n S M,x ( S n+1,x )) x 12 / 29
Learning with correlated beliefs Once we choose action x, Assumption 1 allows us to update θ n+1 = θ n ˆv n+1 φ (S x,n ) T θ n σ 2 ε + φ (S x,n )C n φ (S x,n ) C n φ (S x,n ) C n+1 = C n C n φ (S x,n )φ (S x,n ) T C n σ 2 ε + φ (S x,n )C n φ (S x,n ) resulting in a new induced prior on the value function V n+1 (S x ) = φ (S x ) T θ n+1 Even if C 0 is a diagonal matrix, the updating equations will create empirical covariances 13 / 29
Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 14 / 29
The knowledge gradient policy Previously studied in ranking and selection (Gupta & Miescke 1996) and multi-armed bandits (Ryzhov et al. 2010) One-period look-ahead policy: how much will the next decision improve our expected objective value? KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We look one step ahead: on average, what will the VFA become after the next decision? 15 / 29
Intuition behind KG idea ˆv n+1 S n Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29
Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29
Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29
Derivation Bayesian analysis (Frazier et al. 2009) shows that the conditional distribution of θ n+1, given decision x, is where Z N (0,1) θ n+1 θ n + C n φ (S x,n ) Z σε 2 + φ (S x,n ) T C n φ (S x,n ) This Z is not indexed by S, yielding IE n x max x Q n+1 ( S n+1,x ) = IEmax(a n x x + bn x Z) 18 / 29
Final form of knowledge gradient policy This expectation can be computed, and the KG decision becomes x KG,n = arg maxc (S n,x) + γφ (S x,n ) T θ n x + γ P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) S n+1 This is Bellman s equation plus a KG term representing the value of information The transition probabilities are difficult to compute, but we can simulate K transitions from S x,n to S n+1 and approximate: S n+1 P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) 1 K K k ν KG,n ( S x,n,s n+1 ) k 19 / 29
Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 20 / 29
Setup of experiments Two-dimensional state variable: S n = (R n,p n ) Price variable evolves according to mean-reverting SDE State is continuous, but we discretize the VFA Two performance measures: Online: How much revenue did we collect while training the VFA? Offline: How much revenue can we collect after the VFA is fixed? Two forms of KG: Online: Maximize Bellman s equation plus the value of information Offline: Maximize the value of information only 21 / 29
Numerical results Lookup VFA Basis VFA Offline objective Online objective Mean Avg. SE Mean Avg. SE Offline KG 210.43 0.33-277.37 15.90 Online KG 79.36 0.23 160.28 5.43 BQL 133.65 0.31 76.03 1.87 Bayes EG 85.65 0.31 67.10 3.00 ADP EG 154.47 0.35 7.09 2.57 Offline KG 1136.20 3.54-342.23 19.96 Online KG 871.1285 3.15 44.58 27.71 Basis EG -475.54 2.30-329.03 25.31 Table: Online and offline values (averaged over 1000 sample paths) achieved by different methods in 150 iterations. 22 / 29
Online performance of KG with basis functions Early observations drastically change the VFA, leading to volatile online results However, as the VFA improves, results get better quickly 23 / 29
Policy produced by offline KG After 150 iterations, the final VFA produces a buy low, sell high policy 24 / 29
Implementation issues We can obtain a very good fit, but we need to pick the problem parameters (σε 2 and C 0 ) carefully Some parameter values can cause the estimates θ n to diverge This is a property of basis functions, not KG A comment on basis functions (Sutton et al. 2008) Counterexamples have been known for many yearsin which Q-learning s parameters diverge to infinity for any positive step sizethe need [to handle large state spaces] is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. 25 / 29
Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 26 / 29
Conclusions We have proposed a new policy for making decisions in ADP with parametric value function approximations The policy uses a Bayesian belief structure to calculate the value of information obtained from a single decision We use the knowledge gradient approach to balance exploration and exploitation in both online and offline settings The policy produces very good fits, but requires tuning 27 / 29
References Dearden, R., Friedman, N. & Russell, S. (1998) Bayesian Q-learning. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp. 761-768. Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing 21:4, 599 613. Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54:229-244. 28 / 29
References Ryzhov, I.O., Powell, W. & Frazier, P.I. (2010) The knowledge gradient algorithm for a general class of online learning problems. Submitted to Operations Research. Sutton, R., Szepesvári, C. & Maei, H. (2008) A convergent O(n) algorithm for off-policy temporal difference learning with linear function approximation. Advances in Neural Information Processing Systems 21:1609 1616. Van Roy, B., Bertsekas, D., Lee, Y. & Tsitsiklis, J. (1997) A neuro-dynamic programming approach to retailer inventory management. Proceedings of the 36th IEEE Conference on Decision and Control, pp. 4052 4057. 29 / 29