Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 2 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 3 / 29

Motivation: energy storage We buy energy from the spot market when the price is low and sell when the price is high We store energy in a battery, which can be partially charged or discharged at will The price process is continuous and highly volatile (e.g. mean-reverting SDE) 4 / 29

Motivation: energy storage The state of the problem at time n is S n = (R n,p n ), where R n is the resource level or amount of energy currently stored P n is the current spot price of energy We make a decision x n representing how much to charge (x n 0) or discharge (x n < 0) The single-period revenue is C (S n,x n ) = P n x n We are maximizing total discounted revenue over a large time horizon 5 / 29

Approximate dynamic programming We cannot solve Bellman s equation V (S) = maxc (S,x) + γie [ V ( S ) S ] x S for all states because the price process is continuous Instead, we compute an approximate observation ( ) ˆv n = maxc (S,x) + γv n 1 S M,x (S n,x) x where S M,x maps S n and x to the post-decision state S x,n (Van Roy et al. 1997) 6 / 29

The post-decision state S n S x,n S n+1 S x,n+1 x n P ( S n+1 S x,n) In our simple storage problem, the post-decision state is given by R x,n = R n + x n P x,n = P n The transition from P x,n to P n+1 comes from the price process 7 / 29

The exploration/exploitation dilemma Exploitation: The decision x n = arg max x C (S n,x) + γv n 1 (S x,n ) seems to be the best under the current VFA Exploration: Because the VFA is inaccurate, we might make a different decision, in the hope of discovering a better strategy 1 0 1 0 +20 0 Example: If we start in state 0 with V n 1 = 0 for both states, a pure exploitation strategy leads us to stay in state 0 forever. 8 / 29

Basis functions We use a parametric approximation V (S x ) = F i=1 θ i φ i (S x ) where φ i are basis functions of the post-decision state: ( φ (S x ) = 1,R x,(r x ) 2,P x,(p x ) 2,R x P x) T. The problem is reduced; we only need to fit a finite set of parameters A single observation now changes our beliefs about the entire value surface 9 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 10 / 29

Correlated beliefs using multivariate normal priors Correlations are modeled using a covariance matrix We assume that θ N ( θ 0,C 0) This induces a prior on the value function: IEV (S x ) = ( θ 0) T φ (S x ) ( ( Cov V (S x ),V S x )) = φ (S x ) T C 0 φ (S x ) The quantity V 0 (S x ) = IEV (S x ) represents our initial estimate of the value of S x 11 / 29

Main assumption Assumption 1 (Dearden et al. 1998) The ADP observation ˆv n+1 follows the distribution N ( V (S x,n ),σ 2 ε ) and is independent of past observations. Standard assumption in optimal learning (e.g. bandit problems), but does not hold in ADP The observation ˆv n+1 is biased by definition: ˆv n+1 = maxc ( S n+1,x ) ( + γv n S M,x ( S n+1,x )) x 12 / 29

Learning with correlated beliefs Once we choose action x, Assumption 1 allows us to update θ n+1 = θ n ˆv n+1 φ (S x,n ) T θ n σ 2 ε + φ (S x,n )C n φ (S x,n ) C n φ (S x,n ) C n+1 = C n C n φ (S x,n )φ (S x,n ) T C n σ 2 ε + φ (S x,n )C n φ (S x,n ) resulting in a new induced prior on the value function V n+1 (S x ) = φ (S x ) T θ n+1 Even if C 0 is a diagonal matrix, the updating equations will create empirical covariances 13 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 14 / 29

The knowledge gradient policy Previously studied in ranking and selection (Gupta & Miescke 1996) and multi-armed bandits (Ryzhov et al. 2010) One-period look-ahead policy: how much will the next decision improve our expected objective value? KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We look one step ahead: on average, what will the VFA become after the next decision? 15 / 29

Intuition behind KG idea ˆv n+1 S n Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Intuition behind KG idea θ n ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Intuition behind KG idea θ n+1 ˆv n+1 S n C (S n,x) Can we account for the change from θ n to θ n+1 before we choose to go to S x,n? 16 / 29

Derivation KG decision rule x KG,n = arg maxc (S n,x) + γie n x xv n+1 (S x,n ) We can expand IE n xv n+1 (S x,n ) = S n+1 P ( S n+1 S x,n ) IE n x max x Q n+1 ( S n+1,x ) where Q n+1 ( S n+1,x ) = C ( S n+1,x ) ( ) T + γφ S x,n+1 θ n+1 17 / 29

Derivation Bayesian analysis (Frazier et al. 2009) shows that the conditional distribution of θ n+1, given decision x, is where Z N (0,1) θ n+1 θ n + C n φ (S x,n ) Z σε 2 + φ (S x,n ) T C n φ (S x,n ) This Z is not indexed by S, yielding IE n x max x Q n+1 ( S n+1,x ) = IEmax(a n x x + bn x Z) 18 / 29

Final form of knowledge gradient policy This expectation can be computed, and the KG decision becomes x KG,n = arg maxc (S n,x) + γφ (S x,n ) T θ n x + γ P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) S n+1 This is Bellman s equation plus a KG term representing the value of information The transition probabilities are difficult to compute, but we can simulate K transitions from S x,n to S n+1 and approximate: S n+1 P ( S n+1 S x,n ) ν KG,n ( S x,n,s n+1 ) 1 K K k ν KG,n ( S x,n,s n+1 ) k 19 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 20 / 29

Setup of experiments Two-dimensional state variable: S n = (R n,p n ) Price variable evolves according to mean-reverting SDE State is continuous, but we discretize the VFA Two performance measures: Online: How much revenue did we collect while training the VFA? Offline: How much revenue can we collect after the VFA is fixed? Two forms of KG: Online: Maximize Bellman s equation plus the value of information Offline: Maximize the value of information only 21 / 29

Numerical results Lookup VFA Basis VFA Offline objective Online objective Mean Avg. SE Mean Avg. SE Offline KG 210.43 0.33-277.37 15.90 Online KG 79.36 0.23 160.28 5.43 BQL 133.65 0.31 76.03 1.87 Bayes EG 85.65 0.31 67.10 3.00 ADP EG 154.47 0.35 7.09 2.57 Offline KG 1136.20 3.54-342.23 19.96 Online KG 871.1285 3.15 44.58 27.71 Basis EG -475.54 2.30-329.03 25.31 Table: Online and offline values (averaged over 1000 sample paths) achieved by different methods in 150 iterations. 22 / 29

Online performance of KG with basis functions Early observations drastically change the VFA, leading to volatile online results However, as the VFA improves, results get better quickly 23 / 29

Policy produced by offline KG After 150 iterations, the final VFA produces a buy low, sell high policy 24 / 29

Implementation issues We can obtain a very good fit, but we need to pick the problem parameters (σε 2 and C 0 ) carefully Some parameter values can cause the estimates θ n to diverge This is a property of basis functions, not KG A comment on basis functions (Sutton et al. 2008) Counterexamples have been known for many yearsin which Q-learning s parameters diverge to infinity for any positive step sizethe need [to handle large state spaces] is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. 25 / 29

Outline 1 Introduction 2 Bayesian model with correlated beliefs 3 The knowledge gradient policy for exploration 4 Numerical results: storage example 5 Conclusions 26 / 29

Conclusions We have proposed a new policy for making decisions in ADP with parametric value function approximations The policy uses a Bayesian belief structure to calculate the value of information obtained from a single decision We use the knowledge gradient approach to balance exploration and exploitation in both online and offline settings The policy produces very good fits, but requires tuning 27 / 29

References Dearden, R., Friedman, N. & Russell, S. (1998) Bayesian Q-learning. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp. 761-768. Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing 21:4, 599 613. Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54:229-244. 28 / 29

References Ryzhov, I.O., Powell, W. & Frazier, P.I. (2010) The knowledge gradient algorithm for a general class of online learning problems. Submitted to Operations Research. Sutton, R., Szepesvári, C. & Maei, H. (2008) A convergent O(n) algorithm for off-policy temporal difference learning with linear function approximation. Advances in Neural Information Processing Systems 21:1609 1616. Van Roy, B., Bertsekas, D., Lee, Y. & Tsitsiklis, J. (1997) A neuro-dynamic programming approach to retailer inventory management. Proceedings of the 36th IEEE Conference on Decision and Control, pp. 4052 4057. 29 / 29