THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS

Size: px

Start display at page:

Download "THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS"

Lauren Conley
6 years ago
Views:

1 Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tolk, R. Hill, and M. E. Kuhl, eds. THE KNOWLEDGE GRADENT ALGORTHM USNG LOCALLY PARAMETRC APPROXMATONS Bolong Cheng Electrical Engineering Princeton University Princeton, NJ 08544, USA Arta A. Jamshidi Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA ABSTRACT We are interested in maximizing a general (but continuous) function where observations are noisy and may be expensive. We derive a knowledge gradient policy, which chooses measurements which maximize the expected value of information, while using a locally parametric belief model which uses linear approximations around regions of the function, known as clouds. The method, called DC-RBF (Dirichlet Clouds with Radial Basis Functions) is well suited to recursive estimation, and uses a compact representation of the function which avoids storing the entire history. Our technique allows for correlated beliefs within adjacent subsets of the alternatives and does not pose any a priori assumption on the global shape of the underlying function. Experimental work suggests that the method adapts to a range of arbitrary, continuous functions, and appears to reliably find the optimal solution. 1 NTRODUCTON We consider the problem of maximizing an unknown but continuous function over a finite set of possible alternatives (which may be sampled from a continuous region), where observations of the function are noisy and may be expensive to compute. This problem arises under settings such as simulation-optimization, stochastic search, and ranking and selection. A popular strategy involves response surface methods which fit polynomial approximations to guide the search of the next. Our work was motivated by difficulties we encountered fitting parametric surfaces, even to relatively simple functions. Low order approximations can produce terrible approximations, while higher order models quickly suffer from over fitting. This experience led us to consider a variety of statistical strategies, but ultimately produced a new local parametric procedure called Dirichlet Clouds with Radial Basis Functions (DC-RBF), developed by Jamshidi and Powell (2013). This paper addresses the problem of doing stochastic search using the knowledge gradient, where the underlying belief model is represented using DC-RBF. The optimization of noisy functions, broadly referred to as stochastic search, has been studied thoroughly since the seminal paper Robbins and Monro (1951) which introduces the idea of derivative-based stochastic gradient algorithms. For an extensive coverage of the literature for stochastic search methods see, e.g., Spall (2003), Sutton and Barto (1998), and Fu (2006). A separate line of research has evolved under the umbrella of active (or optimal) learning where observations are made specifically based on some measure of value of information, see Powell and Ryzhov (2012) for a review. The idea of making measurements based on the marginal value of information is introduced by Gupta and Miescke (1996) and extended under the name knowledge gradient using a Bayesian approach which estimates the value of measuring an alternative by the predictive distributions of the means, as shown in Frazier et al. (2008). The online learning setting with discrete alternatives is studied in Gittins (1979); the knowledge gradient is extended to online (bandit) settings in Ryzhov et al. (2012). The case of continuous decisions has been studied in Agrawal (1995) /13/$ EEE 856

2 and Ginebra and Clayton (1995). Most of the previous work in ranking and selection problems assume the alternatives to be independent (alternatives close to each other do not exhibit correlated beliefs), see Nelson et al. (2001). There is a small literature that deals with correlated beliefs (Frazier et al. 2009; Huang et al. 2006). Villemonteix et al. (2009) introduces entropy minimization-based methods for Gaussian processes. An adaptation of the knowledge with correlated beliefs using kernel regression and aggregation of kernels for estimating the belief function is proposed in Barut and Powell (2013). This work builds on the work presented in Mes et al. (2011), where the estimates are the hierarchical aggregates of the values. This work is an extension of the knowledge gradient with linear beliefs, given in Negoescu et al. (2011), to non-parametric beliefs. There are three major classes of function approximation methods: look-up tables, parametric models (linear or non-linear), and nonparametric models. Parametric regression techniques such as linear regression (see Montgomery et al. (2001)) assume that the underlying structure of the data is known a priori and is in the span of the regressor function. Due to the simplicity of this approach it is commonly used for regression. Nonparametric models, such as Eubank (1988), Müller (1988), offer the attraction of considerable generality by using the raw data to build local approximations of the function, producing a flexible but data-intensive representation. Nonparametric models are less sensitive to structural errors arising from a parametric model. Most nonparametric models require keeping track of all observed data points, which make function evaluations increasingly expensive as the algorithm progresses, a serious problem in stochastic search. Bayesian techniques for function approximation or regression are computationally intensive and require storage of all the data points (see Gelman et al. (2003)). Local polynomial models, such as Fan (1992), build linear models around each observation and keep track of all the data points. Another class of approximation algorithms use local approximations around regions of the function, rather than each prior observation. Radial basis functions have attracted considerable attention due to their simplicity and generality. One of the main attractions of the radial basis functions (RBFs) is that the resulting optimization problem can be broken efficiently into linear and nonlinear subproblems. Normalized RBFs are presented in Jones et al. (1990) which perform well for approximation for smaller amount of training data. For a comprehensive treatment on various growing RBF techniques and automatic function approximation technique using RBF, see Jamshidi and Kirby (2007) and references therein. DC-RBF is motivated by the need to approximate functions within stochastic search algorithms where new observations arrive recursively. As we obtain new information from each iteration, DC-RBF provides a fast and flexible method for updating the approximation. DC-RBF is more flexible than classical parametric models, and provides a compact representation to minimize computational overhead. Unlike similar algorithms in the literature, our method has only one tunable parameter, assuming that the input data has been properly scaled. The main contribution of this paper is the derivation of the knowledge gradient using DC-RBF to create the belief model. We show the performance of this proposed idea in the context of several illustrative examples. Section 2 formulates the ranking and selection model and establishes the notation used in this paper. This section also highlights the Bayesian inference that is needed to build the approximation. Section 3 reviews the knowledge gradient using both lookup table and a linear, parametric belief model. Section 4 reviews the DC-RBF approximation technique that is used for constructing the belief model in this paper. Section 5 derives the knowledge gradient using the DC-RBF belie model. Section 6 demonstrates the performance of the proposed methodology using examples drawn from different problem classes. Finally Section 7 provides the concluding remarks and highlights our future work. 2 THE RANKNG AND SELECTON MODEL We consider a finite set of alternatives X ={1,2,...,M} where our belief about the true value µ x of each alternative x X is multivariate normal, given by µ N (θ 0,Σ 0 ), where(θ 0,Σ 0 ) represents the prior mean 857

3 and covariance matrix. f we choose to observe x=x n, we observe ŷ n+1 x = µ x + ε n+1, where E n [µ]=θ n and where ε n+1 is a random variable with known variance λ x or, equivalently, precision βx ε =(λ x ) 1. We define µ to be the vector of all the means [µ 1,...,µ M ]. Having a fixed budget of N measurements, we need to make sequential sampling decisions x 0,x 1,...,x N 1 to learn about these alternatives. As a result of this sequential sampling framework, it is natural to define the filtration F n as the σ-algebra generated by {(x 0,ŷ 1 ),(x 1,ŷ 2 ),...,(x n 1,ŷ n )}. Let E n and Var n denote E[ F n ] and Var[ F n ] (the x 0 x 1 x conditional expectation and variance with respect n 1 to F n ), respectively. n the off-line setting, the goal is to find the optimal alternative after N measurements, where the final decision is x N = argmax θ x N. x X Let Π be the set of all possible measurement policies, and E π be the expectation taken when the policy π Π is used. The problem to find the optimal policy can be written as ] supe [max π θ x N. π Π x X We use the Bayesian setting to sequentially update the estimate of the alternatives. f at time n we choose x n = x, we observe ŷ n+1 x. This action along with the corresponding observation, and our prior belief of µ, we can compute the n + 1st posterior distribution using standard normal sampling with a multivariate normal prior distribution using the following updating equations, Gelman et al. (2003): θ n+1 Σ n+1 = θ n + ŷn+1 θ n x λ x + Σ n xx Σ n e x, (1) = Σ n Σn e x e x Σn λ x + Σ n, (2) xx where e x is the standard basis vector. We can further rearrange Equation (1) as the time n conditional distribution of θ n+1, θ n+1 = θ n + σ(σ n,x n )Z, where σ(σ n,x n )= Σn e x λx +Σ n xx and Z is a standard normal random variable. 3 KNOWLEDGE GRADENT WTH CORRELATED BELEFS n this work we focus on the knowledge gradient with correlated beliefs, which is a sequential decision policy for learning correlated alternatives, Frazier et al. (2009). n this framework the state of knowledge at time n is defined as S n :=(θ n,σ n ). The goal is to pick the best possible option if we stop measuring at time n. The value of being in state S n is defined as V n (S n )= max x X θ n x. f we choose to measure x n = x, a new belief state S n+1 (x) is reached after observing ŷ n+1 x using the updating equations (1) and (2). The value of this new state is V n+1 (S n+1 (x))= max θ x x n+1. X Note that the incremental value due to measuring alternative x is defined as: ν KG,n x = E [ V n+1 (S n+1 (x)) V n (S n ) S n,x n = x ], [ ] = E max θ n+1 x x S n,x n = x max θ X x x n. X 858

4 We would like to maximize the expected value of V n+1 (S n+1 (x)) at time n. The knowledge gradient policy chooses a measurement that would lead to maximize the expected incremental value, i.e., x KG,n = arg max x X νkg,n x. An algorithm to compute the KG values for alternatives with correlated beliefs is demonstrated in Frazier et al. (2009) which is briefly described here. Consider an alternative way to write the KG formula as V KG,n x = h(θ n, σ(σ n,x)), [ ] = E max θ n x x + σ x X (Σn,x n )Z S n,x n = x max θ x x n, X where h(a,b)=e[max i a i + b i Z] max i a i is a generic function for any vector a and b of the same dimension. The expectation can be computed as the point-wise maximum of affine functions a i +b i Z with an algorithm of complexity O(M 2 log(m)). The algorithm sorts the alternatives with b i in increasing order, then removes terms (a i,b i ) if there is some i such that b i = b i and a i > a i (i.e., removing parallel slopes with lower constant intercepts). Then it removes dominated (a i,b i ) if for all Z R there exists some i such that i i and a i + b i Z a i + b i Z. After all of the dominated components are dropped, new vectors (ã, b) of dimension M are created. At the end of this process, we are left with a concave set of affine functions. The function h(a,b) can be computed via h(a,b)= ( b i+1 b i ) f i=1,..., M ( ) ã i ã i+1, (3) b i+1 b i where f(z)=φ(z)+zφ(z). Here, φ(z) and Φ(z) are the normal density and cumulative distribution functions respectively, Frazier et al. (2009). Now suppose we can represent the true µ as a linear combination of a set of parameters, i.e., µ = Xα, where elements of α are the coefficients of the column vectors of a design matrix X. nstead of maintaining a belief on the alternatives, we can maintain a belief on the coefficients, which is generally of a much lower dimension. f we have a multivariate normal distribution on α N (ϑ,c), we can induce a normal distribution on µ, namely,µ N (Xϑ,XCX T ), Negoescu et al. (2010). Note that we use the scripted ϑ for the estimate of coefficients to differentiate them from the alternatives described in the previous section. This linear transformation applies for prior and posterior distributions. f at time n, alternative x is measured, let x n be the row vector of X corresponding to alternative x. We can update ϑ n+1 and C n+1 recursively via ϑ n+1 = ϑ n + ˆεn+1 γ n Cn x n, C n+1 = C n 1 γ n ( Cn x n ( x n ) T C n). This is useful if, for example, we are trying to find the best set of a vector of parameters which might have 10 or 20 dimensions. These equations allow us to generate tens of thousands of potential measurement alternatives x. f we use a lookup table belief model as was used in Frazier et al. (2009), this would involve creating and updating a matrix Σ n with tens of thousands of rows and columns. Using a parametric belief model, we only need to maintain the covariance matrix C n, which is determined by the dimensionality of the parameter vector ϑ. We never need to compute the full matrix XCX T, although we will have to compute a row of this matrix. 859

5 4 DC-RBF AS BELEF MODEL The DC-RBF scheme introduces a cover over the input space to define local regions, as described in Jamshidi and Powell (2013). This scheme only stores a statistical representation of data in local regions and locally approximates the data with a low order polynomial such as a linear model. The local model parameters are quickly updated using recursive least squares. A nonlinear weighting system is associated with each local model which determines the contribution of this local model to the overall model output. The combined effect of the local approximations and the weights associated to them produces a nonlinear function approximation tool. On the data sets that we have studied, this technique is efficient, both in terms of computational time and memory requirements compared to classical nonparametric regression methods. The new method is robust in the presence of homoscedastic and heteroscedastic noise. Our method automatically determines the model order for a given dataset (one of the most challenging tasks in nonlinear function approximation). Unlike similar algorithms in the literature, DC-RBF method has only one tunable parameter; in addition it is asymptotically unbiased and provides an upper bound on the bias for finite samples Jamshidi and Powell (2013). nstead of the linear relationship µ = Xα, we assume there is a non-linear relationship between µ and a, µ = N c i=1 ψ i(c i,w i )Xα i N c i=1 ψ, (4) i(c i,w i ) where ψ i is a kernel function with parameters center c i and width W i. n our model, we employ the Gaussian kernel function ψ(c,w)=exp( x c 2 W ), where x W = x T W x. N c is the total number of kernels, which is automatically determined based on the shape of the function. At time n, if alternative x n = x is chosen, we observe ŷ n+1 x. According to the DC-RBF algorithm, the data point x n needs to be assigned to a cloud; the resulting cloud is determined by the distance between the input data point x n and the center of the current clouds, D i = x n c i. Let = arg min i D. f D > D T, (D T is a given known threshold distance), we create a new cloud U Nc +1 (where N c denotes the number of existing clouds), otherwise we will update cloud U with the new data point, Jamshidi and Powell (2013). We observe that this logic is independent of the observation ŷ n+1 x which simplifies (but also limits) our algorithm. However, the important simplification is that the set of clouds at time n+1, given the clouds at n, is a deterministic function of x n, and this greatly simplifies our adaptation of the knowledge gradient. To update a particular cloud U, we need to update the local regression model and its weight. Locally, the iteration counter for cloud is k where i k i = n, and x n is the attribute vector (note that this is a row vector from the data matrix X). The recursive update for the response follows from the linear regression update ϑ k +1 C k +1 = ϑ k + ˆεk +1 γ k = C k 1 γ k C k x n, ( C k x n ( x n ) T C k ), where we define ˆε n+1 = ŷ n+1 (ϑ n )T x n and γ n = λ x n+( x n ) T C n xn. 860

6 The center of the weighting kernel is updated via the following equation c k +1 = c k + xn c k. k Then the width of the weighting kernel is updated via the Welford formula, given in Knuth (1997). Note that W is computed for each dimension separately below Q k = Q k 1 +(x n Q k )/k, The k -th estimate of the variance is W 2 = 1 k 1 Sk. S k = S k 1 +(x n Q k 1 )(x k Q k ). 5 KNOWLEDGE GRADENT WTH DC-RBF BELEF MODEL t is apparent that Equation (4) is a weighted sum of multiple linear estimates. From the above updating equations of DC-RBF, each data point is used to update the local estimate of one cloud, implying the coefficient vectors α i,i=1,,n c are independent. Suppose we impose normal beliefs on each α i such that α i N (ϑ i,c i ); we note that µ is the linear combination of independent normal random variables, so it is also a normal random variable, where its distribution is expressed as where T i = ψ ix. Nc i=1 ψ i N c µ N ( i=1 T i ϑ i, N c i=1 T i C i T T i ), Substituting this new expression of the belief state back to h(a,b)=e[max i a i + b i Z] max i a i, where a = b = N c i=1 T n i ϑ n i, N c i=1 T n i e x. λ x + N c i=1 (T i ncn i T n,t i ) xx i Cn i T n,t We can compute the KG factor again via the correlated belief method described in Equation (3). 6 EMPRCAL RESULTS Here we demonstrate the performance of the algorithm on a variety of synthesized data sets. The data sets have distinct features in terms of complexity of the response function as well as the noise. Note that the algorithm provides a functional representation of the underlying belief function. 6.1 Quadratic Functions Here we demonstrate KG-DC-RBF maximizing over a quadratic function on the interval[1, 10]. We discretize the interval by 0.1 producing a domain with 91 alternatives. We set the sampling variance as λ = 1 and the distance threshold for DC-RBF, D T = 2. We initialize with a constant prior, with θ 0 = max x (µ x )+3λ and Σ 0 = 100, where is the identity matrix. Figure 1 illustrates the estimate of the function produced by KG-DC-RBF after 50 iterations. We observe that the proposed algorithm has performed well in finding the maximum of the given function. There are 5 components in the DC-RBF model. 861

7 86 84 truth estimate arg max(truth) arg max(est) observations 82 µ(x) x Figure 1: Estimate of a quadratic function with KG-DC-RBF, after 50 iterations. This estimate aims at finding the maximum of the underlying function from a set of noisy observations and does not aim at approximating the original function as a whole. 6.2 Newsvendor Problems n the newsvendor problem, a newsvendor is trying to determine how many units of an item x to stock. The stocking cost and selling price for the item is c and p, respectively. One could assume an arbitrary distribution on demand D. The expected profit is given by f(x)=e[pmin(x,d)] cx. With known prices and an uncertain demand, we want the optimal stocking quantity that maximizes profit. This problem represents a nontrivial challenge for many online estimation methods. First, it is highly heteroscedastic, with much higher measurement noise for larger values of x (especially x ED). Second, if p c is small relative to c, the function is highly nonsymmetric, which causes problems for simple regressors such as quadratic functions. Higher order approximations suffer from overfitting and generally lose the essential concave property of the function. Figure 2 shows an instance of this problem, where we use p = 4, c = 3.9, and D is sampled from N (40, 1). The small variance of the distribution causes the function to have a sharp drop, whereas a larger variance yields a smoother decline. n this particular example, we set the measurement noise λ = 1, along with prior beliefs θ 0 = max x (µ x )+3λ and Σ 0 = 100. n this example, we see that KG-DC-RBF quickly converges to the optimal alternative. This is otherwise very difficult if we have a parametric belief model such as a quadratic function. 6.3 Performance on One-dimensional Test Functions n this section we compare the performance KG-DC-RBF with two other methods of the same class. The first is Pure Exploration, where each alternative is selected with probability 1/M at every time step. The second is knowledge gradient with non-parametric beliefs (KG-NP) presented in Barut and Powell (2013), which uses aggregates over a set of kernel functions to estimate the truth. We tested these policies on two types of Gaussian Processes. All GP functions are defined on mesh points x=1,2,...,100. For all the policies, we select the prior mean to be θ 0 = max i µ i + 3λ, and prior covariance to be Σ 0 = 100. For KG-NP, we use the Epanechnikov kernel and pick the number of the kernel as

8 µ(x) truth estimate arg max(truth) 70 arg max(est) observations x Figure 2: Estimate of the optimal order quantity for a newsvender problem using KG-DC-RBF with D T = Gaussian Process with Homoscedastic Covariance Functions n this part we test KG-DC-RBF policies on Gaussian processes with the covariance function ) Cov(i, j)=σ 2 (i j)2 exp ( ((M 1)ρ) 2, which produces a stationary process with variance σ 2 and length scale ρ. Higher values of ρ produce fewer numbers of peaks, resulting in a smoother function. The variance σ 2 scales the function vertically. Figure 3 illustrates randomly generated Gaussian processes with different values of ρ ρ = 0.05 ρ = ρ = µ(x) x Figure 3: Stationary Gaussian processes with different values of ρ. For our test functions, we use σ 2 = 0.5 and choose ρ [0.05,0.1] and D T = 5. For both of these values, we generate 500 test functions to test the policy. We use the opportunity cost as the performance benchmark, OC(n)=max µ i θ i, i with i = arg max x θx n. This measures the difference between the maximum of the truth and the value of the best alternative found by a given algorithm. We take the average of the opportunity cost for each different parameters over ρ. We also test the policies on two different sampling noise levels. The lower one with 863

9 λ = 0.01 and the high noise with λ = 1 4 (max(µ) min(µ)). The opportunity costs on a log scale for different policies are given in Figure log 10 (OC) 10 1 log 10 (OC) KG-DC-RBF EXP KGNP n 10 2 KG-DC-RBF EXP KGNP n (a) ρ = 0.1,λ = 0.01 (b) ρ = 0.1,λ = d/ KG-DC-RBF EXP KGNP 10 1 KG-DC-RBF EXP KGNP log 10 (OC) 10 1 log 10 (OC) n (c) ρ = 0.05,λ = n (d) ρ = 0.05,λ = d/4 Figure 4: Comparison of policies on homoscedastic GP, where d = max(µ) min(µ). The KG-DC-RBF method outperforms both KG-NP and EXP in the low noise setting (λ = 0.01), due the fact that it could construct the underlying belief model with a few observations. n the high noise setting, with ρ = 0.05, the functions have more peaks and valleys, and the advantage of KG-DC-RBF becomes less apparent with the fixed D T = 5, while keeping a slightly better performance. The high sampling variance makes some of the smaller peaks and valleys indistinguishable and requires more alternatives to sample from, making it difficult to identify local maxima Gaussian Process with Heteroscedastic Covariance Functions n this part we consider non-stationary covariance functions, particularly the Gibbs covariance function, Gibbs (1997). t has a similar structure to the exponential covariance function, as described above, but is heteroscedastic. The Gibbs covariance function is given by ) Cov(i, j)=σ 2 2(l(i)l( j) ( l(i) 2 + l( j) 2 exp (i j)2 l(i) 2 + l( j) 2, 864

10 where l(i) is an arbitrary positive function in i. Here we choose a horizontally shifted periodic sine function given by ( ( ( ))) i l(i)=10 1+sin 2π u + 1, where u is a random number from [0,1]. The opportunity costs on a log scale for different policies are given in Figure log 10 (OC) 10 1 log 10 (OC) KG DC RBF EXP KG NP n 10 2 KG DC RBF EXP KG NP n (a) λ = 0.01 (b) λ = d/4 Figure 5: Comparison of policies on heteroscedastic GP, where d = max(µ) min(µ). Similar to the previous experiment in the stationary case, we observe that the KG-DC-RBF outperforms both KG-NP and EXP in the low noise setting. Note that in the heteroscedastic case the covariance between alternatives changes according to the index of the alternative, which makes it a harder estimation problem. Similar to the previous experiment we observe that in the high noise case, the advantage of KG-DC-RBF becomes less apparent with the fixed D T = 2, while keeping a slightly better performance. These experiments show that the KG-DC-RBF method is robust in the heteroscedastic case. n addition, the KG-DC-RBF is faster and requires less storage. 7 CONCLUSON n this paper we introduced the adaptation of the knowledge gradient technique for optimizing functions using the DC-RBF belief model. The knowledge gradient offers an efficient search strategy by maximizing the marginal value of a measurement. We note that the method focuses on finding the optimal solution rather than producing a good approximation of the function (even near the optimum). The method offers some useful features. t overcomes the need for extensive tuning as required by parametric approximations. The local parametric approach automatically adapts to the order of the function. The belief model naturally accommodates heteroscedastic noise. We avoid the need to store the entire history of observations, but we do require storing history in the form of a smaller set of clouds. We also avoid the need to specify hyperparameters for Bayesian priors. The updating of the belief model is fully recursive, and the knowledge gradient is quite easy to compute using the knowledge gradient for correlated beliefs using the linear belief model for each cloud. 865

11 The proposed technique can be used both for on-line and off-line learning problems by using the simple bridging formula given in Ryzhov et al. (2012). The experimental work shown here is encouraging. Not surprisingly, we get better performance on functions with relatively low noise in the measurement process. Our experiments were all performed on relatively low-dimensional functions, and it remains to be seen how it would scale to higher dimensional problems. n addition, we have found that the fitting of linear models to clouds with a small number of datapoints can be highly unreliable, especially in the presence of high levels of measurement noise. For this reason, we are investigating the use of a hierarchical blend of constant and linear functions with weights that adapt to the quality of the fit. REFERENCES Agrawal, R The Continuum-Armed Bandit Problem. SAM J. Control Optimization 33: Barut, E., and W. B. Powell Optimal Learning for Sequential Sampling with Non-Parametric Beliefs. J. of Global Optimization:(to appear) Eubank, R. L Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York. Fan, J. 1992, Dec. Design-Adaptive Nonparametric Regression. J. of the American Statistical Association 87 (420): Frazier, P., W. Powell, and S. Dayanik. 2009, May. The Knowledge-Gradient Policy for Correlated Normal Beliefs. NFORMS J. on Computing 21 (4): Frazier, P.., W. B. Powell, and S. Dayanik A Knowledge-Gradient Policy for Sequential nformation Collection. SAM J. Control Optim. 47 (5): Fu, M. C Handbooks in Operations Research and Management Science, Chapter 19 Gradient Estimation, in Simulation, Elsevier. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin Bayesian Data Analysis. second ed. Chapman and Hall/CRC. Gibbs, M Bayesian Gaussian Processes for Regression and Classification. Ph. D. thesis, University of Cambridge. Ginebra, J., and M. K. Clayton Response Surface Bandits. J. of the Royal Statistical Society, Series B (Methodological) 57: Gittins, J. C Bandit Processes and Dynamic Allocation ndices. J. of the Royal Statistical Society, Series B (Methodological) 41: Gupta, S. S., and K. J. Miescke Bayesian Look Ahead One-Stage Sampling Allocations for Selection of the Best Population. J. of Statistical Planning and nference 54: Huang, D., T. T. Allen, W.. Notz, and N. Zeng Global Optimization of Stochastic Black-Box Systems Via Sequential Kriging Meta-Models. J. of Global Optimization 34: Jamshidi, A. A., and M. J. Kirby Towards a Black Box Algorithm for Nonlinear Function Approximation over High-Dimensional Domains. SAM J. of Scientific Computation 29 (3): Jamshidi, A. A., and W. B. Powell. A Recursive Local Polynomial Approximation Method using Dirichlet Clouds and Radial Basis Functions. SAM J. on Scientific Computation:submitted Jones, R. D., Y. Lee, C. W. Barnes, G. W. Flake, K. Lee, P. S. Lewis, and S. Qian Function Approximation and Time Series Prediction with Neural Networks. JCNN nternational Joint Conference on Neural Networks 1: Knuth, D. E Art of Computer Programming. 3rd ed, Volume 2. Addison-Wesley Professional. Mes, M. R., W. B. Powell, and P.. Frazier Hierarchical Knowledge Gradient for Sequential Sampling Hierarchical Knowledge Gradient for Sequential Sampling. J. of Machine Learning Research 12: Montgomery, D. C., E. A. Peck, and G. G. Vining ntroduction to Linear Regression Analysis. 3 edition ed. Wiley-nterscience. 866

12 Müller, H. G Weighted Local Regression and Kernel Methods for Nonparametric Curve Fitting. J. of the American Statistical Association 82: Negoescu, D. M., P.. Frazier, and W. B. Powell The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery. NFORMS J. on Computing 23 (3): Negoescu, D. M., P.. Frazier, and W. B. Powell The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery. NFORMS J. on Computing 23: Nelson, B. L., J. Swann, D. Goldsman, and W. Song Simple Procedures for Selecting the Best Simulated System when the Number of Alternatives is Large. Operations Research 49: Powell, W. B., and. O. Ryzhov Optimal Learning. John Wiley & Sons nc. Ryzhov,. O., W. B. Powell, and P.. Frazier The Knowledge Gradient for a General Class of Online Learning Problems. Operations Research 60: Robbins, H., and S. Monro A Stochastic Approximation Method. The Annals of Mathematical Statistics 22: Spall, J. C ntroduction to Stochastic Search and Optimization. John Wiley & Sons nc. Sutton, R. S., and A. G. Barto ntroduction to Reinforcement Learning. MT Press. Villemonteix, J., E. Vazquez, and E. Walter An nformational Approach to the Global Optimization of Expensive-to-Evaluate Functions. J. of Global Optimization 44: AUTHOR BOGRAPHES BOLONG (HARVEY) CHENG is a Ph.D. student in Electrical Engineering at Princeton University. He has a B.Sc. degree in Electrical Engineering and is doing research in stochastic optimization in energy. His is bcheng@princeton.edu. ARTA A. JAMSHD is a Research Associate in the Department of Operations Research and Financial Engineering at Princeton University. He is an expert in function approximation. He earned his Ph.D. in Mathematics in 2008 developing novel nonlinear signal processing tools that serve various applications. His research interests include machine learning, signal and image processing, optimization and mathematical modeling. His address is arta@princeton.edu. WARREN B. POWELL is a Professor in the Department of Operations Research and Financial Engineering at Princeton University, and director of CASTLE Labs ( which focuses on computational stochastic optimization and learning. He has taught at Princeton University since His address is powell@princeton.edu and website is 867

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29