THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS

Size: px
Start display at page:

Download "THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS"

Transcription

1 Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tolk, R. Hill, and M. E. Kuhl, eds. THE KNOWLEDGE GRADENT ALGORTHM USNG LOCALLY PARAMETRC APPROXMATONS Bolong Cheng Electrical Engineering Princeton University Princeton, NJ 08544, USA Arta A. Jamshidi Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA ABSTRACT We are interested in maximizing a general (but continuous) function where observations are noisy and may be expensive. We derive a knowledge gradient policy, which chooses measurements which maximize the expected value of information, while using a locally parametric belief model which uses linear approximations around regions of the function, known as clouds. The method, called DC-RBF (Dirichlet Clouds with Radial Basis Functions) is well suited to recursive estimation, and uses a compact representation of the function which avoids storing the entire history. Our technique allows for correlated beliefs within adjacent subsets of the alternatives and does not pose any a priori assumption on the global shape of the underlying function. Experimental work suggests that the method adapts to a range of arbitrary, continuous functions, and appears to reliably find the optimal solution. 1 NTRODUCTON We consider the problem of maximizing an unknown but continuous function over a finite set of possible alternatives (which may be sampled from a continuous region), where observations of the function are noisy and may be expensive to compute. This problem arises under settings such as simulation-optimization, stochastic search, and ranking and selection. A popular strategy involves response surface methods which fit polynomial approximations to guide the search of the next. Our work was motivated by difficulties we encountered fitting parametric surfaces, even to relatively simple functions. Low order approximations can produce terrible approximations, while higher order models quickly suffer from over fitting. This experience led us to consider a variety of statistical strategies, but ultimately produced a new local parametric procedure called Dirichlet Clouds with Radial Basis Functions (DC-RBF), developed by Jamshidi and Powell (2013). This paper addresses the problem of doing stochastic search using the knowledge gradient, where the underlying belief model is represented using DC-RBF. The optimization of noisy functions, broadly referred to as stochastic search, has been studied thoroughly since the seminal paper Robbins and Monro (1951) which introduces the idea of derivative-based stochastic gradient algorithms. For an extensive coverage of the literature for stochastic search methods see, e.g., Spall (2003), Sutton and Barto (1998), and Fu (2006). A separate line of research has evolved under the umbrella of active (or optimal) learning where observations are made specifically based on some measure of value of information, see Powell and Ryzhov (2012) for a review. The idea of making measurements based on the marginal value of information is introduced by Gupta and Miescke (1996) and extended under the name knowledge gradient using a Bayesian approach which estimates the value of measuring an alternative by the predictive distributions of the means, as shown in Frazier et al. (2008). The online learning setting with discrete alternatives is studied in Gittins (1979); the knowledge gradient is extended to online (bandit) settings in Ryzhov et al. (2012). The case of continuous decisions has been studied in Agrawal (1995) /13/$ EEE 856

2 and Ginebra and Clayton (1995). Most of the previous work in ranking and selection problems assume the alternatives to be independent (alternatives close to each other do not exhibit correlated beliefs), see Nelson et al. (2001). There is a small literature that deals with correlated beliefs (Frazier et al. 2009; Huang et al. 2006). Villemonteix et al. (2009) introduces entropy minimization-based methods for Gaussian processes. An adaptation of the knowledge with correlated beliefs using kernel regression and aggregation of kernels for estimating the belief function is proposed in Barut and Powell (2013). This work builds on the work presented in Mes et al. (2011), where the estimates are the hierarchical aggregates of the values. This work is an extension of the knowledge gradient with linear beliefs, given in Negoescu et al. (2011), to non-parametric beliefs. There are three major classes of function approximation methods: look-up tables, parametric models (linear or non-linear), and nonparametric models. Parametric regression techniques such as linear regression (see Montgomery et al. (2001)) assume that the underlying structure of the data is known a priori and is in the span of the regressor function. Due to the simplicity of this approach it is commonly used for regression. Nonparametric models, such as Eubank (1988), Müller (1988), offer the attraction of considerable generality by using the raw data to build local approximations of the function, producing a flexible but data-intensive representation. Nonparametric models are less sensitive to structural errors arising from a parametric model. Most nonparametric models require keeping track of all observed data points, which make function evaluations increasingly expensive as the algorithm progresses, a serious problem in stochastic search. Bayesian techniques for function approximation or regression are computationally intensive and require storage of all the data points (see Gelman et al. (2003)). Local polynomial models, such as Fan (1992), build linear models around each observation and keep track of all the data points. Another class of approximation algorithms use local approximations around regions of the function, rather than each prior observation. Radial basis functions have attracted considerable attention due to their simplicity and generality. One of the main attractions of the radial basis functions (RBFs) is that the resulting optimization problem can be broken efficiently into linear and nonlinear subproblems. Normalized RBFs are presented in Jones et al. (1990) which perform well for approximation for smaller amount of training data. For a comprehensive treatment on various growing RBF techniques and automatic function approximation technique using RBF, see Jamshidi and Kirby (2007) and references therein. DC-RBF is motivated by the need to approximate functions within stochastic search algorithms where new observations arrive recursively. As we obtain new information from each iteration, DC-RBF provides a fast and flexible method for updating the approximation. DC-RBF is more flexible than classical parametric models, and provides a compact representation to minimize computational overhead. Unlike similar algorithms in the literature, our method has only one tunable parameter, assuming that the input data has been properly scaled. The main contribution of this paper is the derivation of the knowledge gradient using DC-RBF to create the belief model. We show the performance of this proposed idea in the context of several illustrative examples. Section 2 formulates the ranking and selection model and establishes the notation used in this paper. This section also highlights the Bayesian inference that is needed to build the approximation. Section 3 reviews the knowledge gradient using both lookup table and a linear, parametric belief model. Section 4 reviews the DC-RBF approximation technique that is used for constructing the belief model in this paper. Section 5 derives the knowledge gradient using the DC-RBF belie model. Section 6 demonstrates the performance of the proposed methodology using examples drawn from different problem classes. Finally Section 7 provides the concluding remarks and highlights our future work. 2 THE RANKNG AND SELECTON MODEL We consider a finite set of alternatives X ={1,2,...,M} where our belief about the true value µ x of each alternative x X is multivariate normal, given by µ N (θ 0,Σ 0 ), where(θ 0,Σ 0 ) represents the prior mean 857

3 and covariance matrix. f we choose to observe x=x n, we observe ŷ n+1 x = µ x + ε n+1, where E n [µ]=θ n and where ε n+1 is a random variable with known variance λ x or, equivalently, precision βx ε =(λ x ) 1. We define µ to be the vector of all the means [µ 1,...,µ M ]. Having a fixed budget of N measurements, we need to make sequential sampling decisions x 0,x 1,...,x N 1 to learn about these alternatives. As a result of this sequential sampling framework, it is natural to define the filtration F n as the σ-algebra generated by {(x 0,ŷ 1 ),(x 1,ŷ 2 ),...,(x n 1,ŷ n )}. Let E n and Var n denote E[ F n ] and Var[ F n ] (the x 0 x 1 x conditional expectation and variance with respect n 1 to F n ), respectively. n the off-line setting, the goal is to find the optimal alternative after N measurements, where the final decision is x N = argmax θ x N. x X Let Π be the set of all possible measurement policies, and E π be the expectation taken when the policy π Π is used. The problem to find the optimal policy can be written as ] supe [max π θ x N. π Π x X We use the Bayesian setting to sequentially update the estimate of the alternatives. f at time n we choose x n = x, we observe ŷ n+1 x. This action along with the corresponding observation, and our prior belief of µ, we can compute the n + 1st posterior distribution using standard normal sampling with a multivariate normal prior distribution using the following updating equations, Gelman et al. (2003): θ n+1 Σ n+1 = θ n + ŷn+1 θ n x λ x + Σ n xx Σ n e x, (1) = Σ n Σn e x e x Σn λ x + Σ n, (2) xx where e x is the standard basis vector. We can further rearrange Equation (1) as the time n conditional distribution of θ n+1, θ n+1 = θ n + σ(σ n,x n )Z, where σ(σ n,x n )= Σn e x λx +Σ n xx and Z is a standard normal random variable. 3 KNOWLEDGE GRADENT WTH CORRELATED BELEFS n this work we focus on the knowledge gradient with correlated beliefs, which is a sequential decision policy for learning correlated alternatives, Frazier et al. (2009). n this framework the state of knowledge at time n is defined as S n :=(θ n,σ n ). The goal is to pick the best possible option if we stop measuring at time n. The value of being in state S n is defined as V n (S n )= max x X θ n x. f we choose to measure x n = x, a new belief state S n+1 (x) is reached after observing ŷ n+1 x using the updating equations (1) and (2). The value of this new state is V n+1 (S n+1 (x))= max θ x x n+1. X Note that the incremental value due to measuring alternative x is defined as: ν KG,n x = E [ V n+1 (S n+1 (x)) V n (S n ) S n,x n = x ], [ ] = E max θ n+1 x x S n,x n = x max θ X x x n. X 858

4 We would like to maximize the expected value of V n+1 (S n+1 (x)) at time n. The knowledge gradient policy chooses a measurement that would lead to maximize the expected incremental value, i.e., x KG,n = arg max x X νkg,n x. An algorithm to compute the KG values for alternatives with correlated beliefs is demonstrated in Frazier et al. (2009) which is briefly described here. Consider an alternative way to write the KG formula as V KG,n x = h(θ n, σ(σ n,x)), [ ] = E max θ n x x + σ x X (Σn,x n )Z S n,x n = x max θ x x n, X where h(a,b)=e[max i a i + b i Z] max i a i is a generic function for any vector a and b of the same dimension. The expectation can be computed as the point-wise maximum of affine functions a i +b i Z with an algorithm of complexity O(M 2 log(m)). The algorithm sorts the alternatives with b i in increasing order, then removes terms (a i,b i ) if there is some i such that b i = b i and a i > a i (i.e., removing parallel slopes with lower constant intercepts). Then it removes dominated (a i,b i ) if for all Z R there exists some i such that i i and a i + b i Z a i + b i Z. After all of the dominated components are dropped, new vectors (ã, b) of dimension M are created. At the end of this process, we are left with a concave set of affine functions. The function h(a,b) can be computed via h(a,b)= ( b i+1 b i ) f i=1,..., M ( ) ã i ã i+1, (3) b i+1 b i where f(z)=φ(z)+zφ(z). Here, φ(z) and Φ(z) are the normal density and cumulative distribution functions respectively, Frazier et al. (2009). Now suppose we can represent the true µ as a linear combination of a set of parameters, i.e., µ = Xα, where elements of α are the coefficients of the column vectors of a design matrix X. nstead of maintaining a belief on the alternatives, we can maintain a belief on the coefficients, which is generally of a much lower dimension. f we have a multivariate normal distribution on α N (ϑ,c), we can induce a normal distribution on µ, namely,µ N (Xϑ,XCX T ), Negoescu et al. (2010). Note that we use the scripted ϑ for the estimate of coefficients to differentiate them from the alternatives described in the previous section. This linear transformation applies for prior and posterior distributions. f at time n, alternative x is measured, let x n be the row vector of X corresponding to alternative x. We can update ϑ n+1 and C n+1 recursively via ϑ n+1 = ϑ n + ˆεn+1 γ n Cn x n, C n+1 = C n 1 γ n ( Cn x n ( x n ) T C n). This is useful if, for example, we are trying to find the best set of a vector of parameters which might have 10 or 20 dimensions. These equations allow us to generate tens of thousands of potential measurement alternatives x. f we use a lookup table belief model as was used in Frazier et al. (2009), this would involve creating and updating a matrix Σ n with tens of thousands of rows and columns. Using a parametric belief model, we only need to maintain the covariance matrix C n, which is determined by the dimensionality of the parameter vector ϑ. We never need to compute the full matrix XCX T, although we will have to compute a row of this matrix. 859

5 4 DC-RBF AS BELEF MODEL The DC-RBF scheme introduces a cover over the input space to define local regions, as described in Jamshidi and Powell (2013). This scheme only stores a statistical representation of data in local regions and locally approximates the data with a low order polynomial such as a linear model. The local model parameters are quickly updated using recursive least squares. A nonlinear weighting system is associated with each local model which determines the contribution of this local model to the overall model output. The combined effect of the local approximations and the weights associated to them produces a nonlinear function approximation tool. On the data sets that we have studied, this technique is efficient, both in terms of computational time and memory requirements compared to classical nonparametric regression methods. The new method is robust in the presence of homoscedastic and heteroscedastic noise. Our method automatically determines the model order for a given dataset (one of the most challenging tasks in nonlinear function approximation). Unlike similar algorithms in the literature, DC-RBF method has only one tunable parameter; in addition it is asymptotically unbiased and provides an upper bound on the bias for finite samples Jamshidi and Powell (2013). nstead of the linear relationship µ = Xα, we assume there is a non-linear relationship between µ and a, µ = N c i=1 ψ i(c i,w i )Xα i N c i=1 ψ, (4) i(c i,w i ) where ψ i is a kernel function with parameters center c i and width W i. n our model, we employ the Gaussian kernel function ψ(c,w)=exp( x c 2 W ), where x W = x T W x. N c is the total number of kernels, which is automatically determined based on the shape of the function. At time n, if alternative x n = x is chosen, we observe ŷ n+1 x. According to the DC-RBF algorithm, the data point x n needs to be assigned to a cloud; the resulting cloud is determined by the distance between the input data point x n and the center of the current clouds, D i = x n c i. Let = arg min i D. f D > D T, (D T is a given known threshold distance), we create a new cloud U Nc +1 (where N c denotes the number of existing clouds), otherwise we will update cloud U with the new data point, Jamshidi and Powell (2013). We observe that this logic is independent of the observation ŷ n+1 x which simplifies (but also limits) our algorithm. However, the important simplification is that the set of clouds at time n+1, given the clouds at n, is a deterministic function of x n, and this greatly simplifies our adaptation of the knowledge gradient. To update a particular cloud U, we need to update the local regression model and its weight. Locally, the iteration counter for cloud is k where i k i = n, and x n is the attribute vector (note that this is a row vector from the data matrix X). The recursive update for the response follows from the linear regression update ϑ k +1 C k +1 = ϑ k + ˆεk +1 γ k = C k 1 γ k C k x n, ( C k x n ( x n ) T C k ), where we define ˆε n+1 = ŷ n+1 (ϑ n )T x n and γ n = λ x n+( x n ) T C n xn. 860

6 The center of the weighting kernel is updated via the following equation c k +1 = c k + xn c k. k Then the width of the weighting kernel is updated via the Welford formula, given in Knuth (1997). Note that W is computed for each dimension separately below Q k = Q k 1 +(x n Q k )/k, The k -th estimate of the variance is W 2 = 1 k 1 Sk. S k = S k 1 +(x n Q k 1 )(x k Q k ). 5 KNOWLEDGE GRADENT WTH DC-RBF BELEF MODEL t is apparent that Equation (4) is a weighted sum of multiple linear estimates. From the above updating equations of DC-RBF, each data point is used to update the local estimate of one cloud, implying the coefficient vectors α i,i=1,,n c are independent. Suppose we impose normal beliefs on each α i such that α i N (ϑ i,c i ); we note that µ is the linear combination of independent normal random variables, so it is also a normal random variable, where its distribution is expressed as where T i = ψ ix. Nc i=1 ψ i N c µ N ( i=1 T i ϑ i, N c i=1 T i C i T T i ), Substituting this new expression of the belief state back to h(a,b)=e[max i a i + b i Z] max i a i, where a = b = N c i=1 T n i ϑ n i, N c i=1 T n i e x. λ x + N c i=1 (T i ncn i T n,t i ) xx i Cn i T n,t We can compute the KG factor again via the correlated belief method described in Equation (3). 6 EMPRCAL RESULTS Here we demonstrate the performance of the algorithm on a variety of synthesized data sets. The data sets have distinct features in terms of complexity of the response function as well as the noise. Note that the algorithm provides a functional representation of the underlying belief function. 6.1 Quadratic Functions Here we demonstrate KG-DC-RBF maximizing over a quadratic function on the interval[1, 10]. We discretize the interval by 0.1 producing a domain with 91 alternatives. We set the sampling variance as λ = 1 and the distance threshold for DC-RBF, D T = 2. We initialize with a constant prior, with θ 0 = max x (µ x )+3λ and Σ 0 = 100, where is the identity matrix. Figure 1 illustrates the estimate of the function produced by KG-DC-RBF after 50 iterations. We observe that the proposed algorithm has performed well in finding the maximum of the given function. There are 5 components in the DC-RBF model. 861

7 86 84 truth estimate arg max(truth) arg max(est) observations 82 µ(x) x Figure 1: Estimate of a quadratic function with KG-DC-RBF, after 50 iterations. This estimate aims at finding the maximum of the underlying function from a set of noisy observations and does not aim at approximating the original function as a whole. 6.2 Newsvendor Problems n the newsvendor problem, a newsvendor is trying to determine how many units of an item x to stock. The stocking cost and selling price for the item is c and p, respectively. One could assume an arbitrary distribution on demand D. The expected profit is given by f(x)=e[pmin(x,d)] cx. With known prices and an uncertain demand, we want the optimal stocking quantity that maximizes profit. This problem represents a nontrivial challenge for many online estimation methods. First, it is highly heteroscedastic, with much higher measurement noise for larger values of x (especially x ED). Second, if p c is small relative to c, the function is highly nonsymmetric, which causes problems for simple regressors such as quadratic functions. Higher order approximations suffer from overfitting and generally lose the essential concave property of the function. Figure 2 shows an instance of this problem, where we use p = 4, c = 3.9, and D is sampled from N (40, 1). The small variance of the distribution causes the function to have a sharp drop, whereas a larger variance yields a smoother decline. n this particular example, we set the measurement noise λ = 1, along with prior beliefs θ 0 = max x (µ x )+3λ and Σ 0 = 100. n this example, we see that KG-DC-RBF quickly converges to the optimal alternative. This is otherwise very difficult if we have a parametric belief model such as a quadratic function. 6.3 Performance on One-dimensional Test Functions n this section we compare the performance KG-DC-RBF with two other methods of the same class. The first is Pure Exploration, where each alternative is selected with probability 1/M at every time step. The second is knowledge gradient with non-parametric beliefs (KG-NP) presented in Barut and Powell (2013), which uses aggregates over a set of kernel functions to estimate the truth. We tested these policies on two types of Gaussian Processes. All GP functions are defined on mesh points x=1,2,...,100. For all the policies, we select the prior mean to be θ 0 = max i µ i + 3λ, and prior covariance to be Σ 0 = 100. For KG-NP, we use the Epanechnikov kernel and pick the number of the kernel as

8 µ(x) truth estimate arg max(truth) 70 arg max(est) observations x Figure 2: Estimate of the optimal order quantity for a newsvender problem using KG-DC-RBF with D T = Gaussian Process with Homoscedastic Covariance Functions n this part we test KG-DC-RBF policies on Gaussian processes with the covariance function ) Cov(i, j)=σ 2 (i j)2 exp ( ((M 1)ρ) 2, which produces a stationary process with variance σ 2 and length scale ρ. Higher values of ρ produce fewer numbers of peaks, resulting in a smoother function. The variance σ 2 scales the function vertically. Figure 3 illustrates randomly generated Gaussian processes with different values of ρ ρ = 0.05 ρ = ρ = µ(x) x Figure 3: Stationary Gaussian processes with different values of ρ. For our test functions, we use σ 2 = 0.5 and choose ρ [0.05,0.1] and D T = 5. For both of these values, we generate 500 test functions to test the policy. We use the opportunity cost as the performance benchmark, OC(n)=max µ i θ i, i with i = arg max x θx n. This measures the difference between the maximum of the truth and the value of the best alternative found by a given algorithm. We take the average of the opportunity cost for each different parameters over ρ. We also test the policies on two different sampling noise levels. The lower one with 863

9 λ = 0.01 and the high noise with λ = 1 4 (max(µ) min(µ)). The opportunity costs on a log scale for different policies are given in Figure log 10 (OC) 10 1 log 10 (OC) KG-DC-RBF EXP KGNP n 10 2 KG-DC-RBF EXP KGNP n (a) ρ = 0.1,λ = 0.01 (b) ρ = 0.1,λ = d/ KG-DC-RBF EXP KGNP 10 1 KG-DC-RBF EXP KGNP log 10 (OC) 10 1 log 10 (OC) n (c) ρ = 0.05,λ = n (d) ρ = 0.05,λ = d/4 Figure 4: Comparison of policies on homoscedastic GP, where d = max(µ) min(µ). The KG-DC-RBF method outperforms both KG-NP and EXP in the low noise setting (λ = 0.01), due the fact that it could construct the underlying belief model with a few observations. n the high noise setting, with ρ = 0.05, the functions have more peaks and valleys, and the advantage of KG-DC-RBF becomes less apparent with the fixed D T = 5, while keeping a slightly better performance. The high sampling variance makes some of the smaller peaks and valleys indistinguishable and requires more alternatives to sample from, making it difficult to identify local maxima Gaussian Process with Heteroscedastic Covariance Functions n this part we consider non-stationary covariance functions, particularly the Gibbs covariance function, Gibbs (1997). t has a similar structure to the exponential covariance function, as described above, but is heteroscedastic. The Gibbs covariance function is given by ) Cov(i, j)=σ 2 2(l(i)l( j) ( l(i) 2 + l( j) 2 exp (i j)2 l(i) 2 + l( j) 2, 864

10 where l(i) is an arbitrary positive function in i. Here we choose a horizontally shifted periodic sine function given by ( ( ( ))) i l(i)=10 1+sin 2π u + 1, where u is a random number from [0,1]. The opportunity costs on a log scale for different policies are given in Figure log 10 (OC) 10 1 log 10 (OC) KG DC RBF EXP KG NP n 10 2 KG DC RBF EXP KG NP n (a) λ = 0.01 (b) λ = d/4 Figure 5: Comparison of policies on heteroscedastic GP, where d = max(µ) min(µ). Similar to the previous experiment in the stationary case, we observe that the KG-DC-RBF outperforms both KG-NP and EXP in the low noise setting. Note that in the heteroscedastic case the covariance between alternatives changes according to the index of the alternative, which makes it a harder estimation problem. Similar to the previous experiment we observe that in the high noise case, the advantage of KG-DC-RBF becomes less apparent with the fixed D T = 2, while keeping a slightly better performance. These experiments show that the KG-DC-RBF method is robust in the heteroscedastic case. n addition, the KG-DC-RBF is faster and requires less storage. 7 CONCLUSON n this paper we introduced the adaptation of the knowledge gradient technique for optimizing functions using the DC-RBF belief model. The knowledge gradient offers an efficient search strategy by maximizing the marginal value of a measurement. We note that the method focuses on finding the optimal solution rather than producing a good approximation of the function (even near the optimum). The method offers some useful features. t overcomes the need for extensive tuning as required by parametric approximations. The local parametric approach automatically adapts to the order of the function. The belief model naturally accommodates heteroscedastic noise. We avoid the need to store the entire history of observations, but we do require storing history in the form of a smaller set of clouds. We also avoid the need to specify hyperparameters for Bayesian priors. The updating of the belief model is fully recursive, and the knowledge gradient is quite easy to compute using the knowledge gradient for correlated beliefs using the linear belief model for each cloud. 865

11 The proposed technique can be used both for on-line and off-line learning problems by using the simple bridging formula given in Ryzhov et al. (2012). The experimental work shown here is encouraging. Not surprisingly, we get better performance on functions with relatively low noise in the measurement process. Our experiments were all performed on relatively low-dimensional functions, and it remains to be seen how it would scale to higher dimensional problems. n addition, we have found that the fitting of linear models to clouds with a small number of datapoints can be highly unreliable, especially in the presence of high levels of measurement noise. For this reason, we are investigating the use of a hierarchical blend of constant and linear functions with weights that adapt to the quality of the fit. REFERENCES Agrawal, R The Continuum-Armed Bandit Problem. SAM J. Control Optimization 33: Barut, E., and W. B. Powell Optimal Learning for Sequential Sampling with Non-Parametric Beliefs. J. of Global Optimization:(to appear) Eubank, R. L Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York. Fan, J. 1992, Dec. Design-Adaptive Nonparametric Regression. J. of the American Statistical Association 87 (420): Frazier, P., W. Powell, and S. Dayanik. 2009, May. The Knowledge-Gradient Policy for Correlated Normal Beliefs. NFORMS J. on Computing 21 (4): Frazier, P.., W. B. Powell, and S. Dayanik A Knowledge-Gradient Policy for Sequential nformation Collection. SAM J. Control Optim. 47 (5): Fu, M. C Handbooks in Operations Research and Management Science, Chapter 19 Gradient Estimation, in Simulation, Elsevier. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin Bayesian Data Analysis. second ed. Chapman and Hall/CRC. Gibbs, M Bayesian Gaussian Processes for Regression and Classification. Ph. D. thesis, University of Cambridge. Ginebra, J., and M. K. Clayton Response Surface Bandits. J. of the Royal Statistical Society, Series B (Methodological) 57: Gittins, J. C Bandit Processes and Dynamic Allocation ndices. J. of the Royal Statistical Society, Series B (Methodological) 41: Gupta, S. S., and K. J. Miescke Bayesian Look Ahead One-Stage Sampling Allocations for Selection of the Best Population. J. of Statistical Planning and nference 54: Huang, D., T. T. Allen, W.. Notz, and N. Zeng Global Optimization of Stochastic Black-Box Systems Via Sequential Kriging Meta-Models. J. of Global Optimization 34: Jamshidi, A. A., and M. J. Kirby Towards a Black Box Algorithm for Nonlinear Function Approximation over High-Dimensional Domains. SAM J. of Scientific Computation 29 (3): Jamshidi, A. A., and W. B. Powell. A Recursive Local Polynomial Approximation Method using Dirichlet Clouds and Radial Basis Functions. SAM J. on Scientific Computation:submitted Jones, R. D., Y. Lee, C. W. Barnes, G. W. Flake, K. Lee, P. S. Lewis, and S. Qian Function Approximation and Time Series Prediction with Neural Networks. JCNN nternational Joint Conference on Neural Networks 1: Knuth, D. E Art of Computer Programming. 3rd ed, Volume 2. Addison-Wesley Professional. Mes, M. R., W. B. Powell, and P.. Frazier Hierarchical Knowledge Gradient for Sequential Sampling Hierarchical Knowledge Gradient for Sequential Sampling. J. of Machine Learning Research 12: Montgomery, D. C., E. A. Peck, and G. G. Vining ntroduction to Linear Regression Analysis. 3 edition ed. Wiley-nterscience. 866

12 Müller, H. G Weighted Local Regression and Kernel Methods for Nonparametric Curve Fitting. J. of the American Statistical Association 82: Negoescu, D. M., P.. Frazier, and W. B. Powell The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery. NFORMS J. on Computing 23 (3): Negoescu, D. M., P.. Frazier, and W. B. Powell The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery. NFORMS J. on Computing 23: Nelson, B. L., J. Swann, D. Goldsman, and W. Song Simple Procedures for Selecting the Best Simulated System when the Number of Alternatives is Large. Operations Research 49: Powell, W. B., and. O. Ryzhov Optimal Learning. John Wiley & Sons nc. Ryzhov,. O., W. B. Powell, and P.. Frazier The Knowledge Gradient for a General Class of Online Learning Problems. Operations Research 60: Robbins, H., and S. Monro A Stochastic Approximation Method. The Annals of Mathematical Statistics 22: Spall, J. C ntroduction to Stochastic Search and Optimization. John Wiley & Sons nc. Sutton, R. S., and A. G. Barto ntroduction to Reinforcement Learning. MT Press. Villemonteix, J., E. Vazquez, and E. Walter An nformational Approach to the Global Optimization of Expensive-to-Evaluate Functions. J. of Global Optimization 44: AUTHOR BOGRAPHES BOLONG (HARVEY) CHENG is a Ph.D. student in Electrical Engineering at Princeton University. He has a B.Sc. degree in Electrical Engineering and is doing research in stochastic optimization in energy. His is bcheng@princeton.edu. ARTA A. JAMSHD is a Research Associate in the Department of Operations Research and Financial Engineering at Princeton University. He is an expert in function approximation. He earned his Ph.D. in Mathematics in 2008 developing novel nonlinear signal processing tools that serve various applications. His research interests include machine learning, signal and image processing, optimization and mathematical modeling. His address is arta@princeton.edu. WARREN B. POWELL is a Professor in the Department of Operations Research and Financial Engineering at Princeton University, and director of CASTLE Labs ( which focuses on computational stochastic optimization and learning. He has taught at Princeton University since His address is powell@princeton.edu and website is 867

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery Diana M. Negoescu Peter I. Frazier Warren B. Powell August 3, 9 Abstract We present a new technique for adaptively choosing

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov and Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544 Abstract A common technique for

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Peter I. Frazier Jing Xie. Stephen E. Chick. Technology & Operations Management Area INSEAD Boulevard de Constance Fontainebleau, FRANCE

Peter I. Frazier Jing Xie. Stephen E. Chick. Technology & Operations Management Area INSEAD Boulevard de Constance Fontainebleau, FRANCE Proceedings of the 2011 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. VALUE OF INFORMATION METHODS FOR PAIRWISE SAMPLING WITH CORRELATIONS Peter I. Frazier

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery Diana M. Negoescu Peter I. Frazier Warren B. Powell May 6, 010 Abstract We present a new technique for adaptively choosing

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science 00 (200) (202) 0 635 644 International Conference on Computational Science, ICCS 200 Procedia Computer Science www.elsevier.com/locate/procedia On the robustness of a one-period

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

Nonparametric Regression With Gaussian Processes

Nonparametric Regression With Gaussian Processes Nonparametric Regression With Gaussian Processes From Chap. 45, Information Theory, Inference and Learning Algorithms, D. J. C. McKay Presented by Micha Elsner Nonparametric Regression With Gaussian Processes

More information

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Hierarchical Knowledge Gradient for Sequential Sampling

Hierarchical Knowledge Gradient for Sequential Sampling Journal of Machine Learning Research () Submitted ; Published Hierarchical Knowledge Gradient for Sequential Sampling Martijn R.K. Mes Department of Operational Methods for Production and Logistics University

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Predictive Variance Reduction Search

Predictive Variance Reduction Search Predictive Variance Reduction Search Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, Svetha Venkatesh Centre of Pattern Recognition and Data Analytics (PRaDA), Deakin University Email: v.nguyen@deakin.edu.au

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Tokamak profile database construction incorporating Gaussian process regression

Tokamak profile database construction incorporating Gaussian process regression Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Bayesian exploration for approximate dynamic programming

Bayesian exploration for approximate dynamic programming Bayesian exploration for approximate dynamic programming Ilya O. Ryzhov Martijn R.K. Mes Warren B. Powell Gerald A. van den Berg December 18, 2017 Abstract Approximate dynamic programming (ADP) is a general

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Dynamic System Identification using HDMR-Bayesian Technique

Dynamic System Identification using HDMR-Bayesian Technique Dynamic System Identification using HDMR-Bayesian Technique *Shereena O A 1) and Dr. B N Rao 2) 1), 2) Department of Civil Engineering, IIT Madras, Chennai 600036, Tamil Nadu, India 1) ce14d020@smail.iitm.ac.in

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

PIRLS 2016 Achievement Scaling Methodology 1

PIRLS 2016 Achievement Scaling Methodology 1 CHAPTER 11 PIRLS 2016 Achievement Scaling Methodology 1 The PIRLS approach to scaling the achievement data, based on item response theory (IRT) scaling with marginal estimation, was developed originally

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Optimisation séquentielle et application au design

Optimisation séquentielle et application au design Optimisation séquentielle et application au design d expériences Nicolas Vayatis Séminaire Aristote, Ecole Polytechnique - 23 octobre 2014 Joint work with Emile Contal (computer scientist, PhD student)

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. SEQUENTIAL SAMPLING FOR BAYESIAN ROBUST RANKING AND SELECTION

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Mustafa H. Tongarlak Bruce E. Ankenman Barry L. Nelson

Mustafa H. Tongarlak Bruce E. Ankenman Barry L. Nelson Proceedings of the 0 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. RELATIVE ERROR STOCHASTIC KRIGING Mustafa H. Tongarlak Bruce E. Ankenman Barry L.

More information

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Gaussian Process Regression: Active Data Selection and Test Point Rejection Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Department of Computer Science, Technical University of Berlin Franklinstr.8,

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

The Expected Opportunity Cost and Selecting the Optimal Subset

The Expected Opportunity Cost and Selecting the Optimal Subset Applied Mathematical Sciences, Vol. 9, 2015, no. 131, 6507-6519 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2015.58561 The Expected Opportunity Cost and Selecting the Optimal Subset Mohammad

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Efficient Bayesian Multivariate Surface Regression

Efficient Bayesian Multivariate Surface Regression Efficient Bayesian Multivariate Surface Regression Feng Li (joint with Mattias Villani) Department of Statistics, Stockholm University October, 211 Outline of the talk 1 Flexible regression models 2 The

More information