arxiv: v1 [stat.ml] 13 Mar 2017

Size: px
Start display at page:

Download "arxiv: v1 [stat.ml] 13 Mar 2017"

Transcription

1 Bayesian Optimization with Gradients Jian Wu, Matthias Poloczek, Andrew Gordon Wilson, and Peter I. Frazier School of Operations Research and Information Engineering Cornell University arxiv: v1 stat.ml 13 Mar 2017 Abstract In recent years, Bayesian optimization has proven successful for global optimization of expensive-to-evaluate multimodal objective functions. However, unlike most optimization methods, Bayesian optimization typically does not use derivative information. In this paper we show how Bayesian optimization can exploit derivative information to decrease the number of objective required for good performance. In particular, we develop a novel Bayesian optimization algorithm, the derivative-enabled knowledge-gradient (), for which we show one-step Bayes-optimality, asymptotic consistency, and greater one-step value of information than is possible in the derivative-free setting. Our procedure accommodates noisy and incomplete derivative information, and comes in both sequential and batch forms. We show provides state-of-the-art performance compared to a wide range of optimization procedures with and without gradients, on benchmarks including logistic regression, kernel learning, and k-nearest neighbors. 1 Introduction Bayesian optimization Brochu et al., 2010, Kleijnen, 2014, Jones et al., 1998 is able to find global optima with a remarkably small number of potentially noisy objective. Bayesian optimization has thus been particularly successful for automatic hyperparameter tuning of machine learning algorithms Snoek et al., 2012, Swersky et al., 2013, Gelbart et al., 2014, Gardner et al., 2014, where objectives can be extremely expensive to evaluate, noisy, and multimodal. Bayesian optimization supposes that the objective function (e.g., the predictive performance with respect to some hyperparameters) is drawn from a prior distribution over functions, typically a Gaussian process (GP), maintaining a posterior as we observe the objective in new places. Given this distribution, acquisition functions such as as expected improvement Jones et al., 1998, Huang et al., 2006, Picheny et al., 2013, upper confidence bound Srinivas et al., 2010, or the knowledge gradient Scott et al., 2011, Wu and Frazier, 2016, determine a balance between exploration and exploitation, to decide where to query the objective next. By choosing points with the largest acquisition function values, one seeks to identify a global optimum using as few objective as possible. Bayesian optimization procedures do not generally leverage derivative information, beyond a few exceptions described in Section 2, on related work. By contrast, other types of continuous optimization methods Snyman, 2005 use gradient information extensively. The broader use of gradients for optimization suggests that gradients should also be quite useful in Bayesian optimization: (1) Gradients inform us about the objective s relative value as a function of location, which is well-aligned with optimization. (2) In d-dimensional problems, gradients provide d distinct pieces of information about the objective s relative value in each direction, constituting d + 1 values per query together with the objective value itself. In contrast, observing the objective value alone provides only one value per query. This difference is particularly significant for high-dimensional problems. (3) Derivative information is available in many applications at little additional cost. Recent work e.g., Maclaurin et al., 2015 makes gradient information available for hyperparameter tuning. Moreover, in the optimization of engineering systems modeled by partial differential equations, which pre-dates most hyperparameter tuning applications Forrester et al., 2008, adjoint methods provide gradients cheaply Plessix, 2006, Jameson, And even when derivative information is not readily available, we can compute approximative derivatives in parallel through finite differences. In this paper, we explore the what, when, and why of Bayesian optimization with derivative information. We also develop a Bayesian optimization algorithm that effectively leverages gradients in hyperparameter tuning to outperform the state of the art. This algorithm accommodates incomplete and noisy gradient observations, can be used in both the sequen- 1

2 tial and batch settings, and can automatically select the derivatives which will be most useful for the optimization problem. For this purpose, we develop a new acquisition function, called the derivative-enabled knowledge-gradient (). We also provide a theoretical analysis of our algorithm: we show (1) that it is one-step Bayes-optimal when derivatives are available; (2) that the one-step value provided is greater than in the derivative-free setting; and (3) that its estimator of the global optimum is asymptotically consistent when used over a discretized feasible space. In numerical experiments we compare with state-of-the-art batch Bayesian optimization algorithms with and without derivative information, and the gradient-based optimizer BFGS with full gradients. We have made code available at: q/tree/jianwu_9_cpp gradients. We assume familiarity with Gaussian processes and Bayesian optimization, for which we recommend Rasmussen and Williams 2006 and Shahriari et al as a review. In Sect. 2 we begin by describing related work. In Sect. 3 we describe our Bayesian optimization algorithm exploiting derivative information. In Sect. 4 we compare the performance of our algorithm with several competing methods on a collection of synthetic and real problems. 2 Related Work Osborne et al proposes fully Bayesian optimization procedures that use derivative observations to improve the conditioning of the Gaussian process covariance matrix. However, samples taken near previously observed points use only the derivative information to update the covariance matrix. Lizotte 2008, Sect and Sect incorporates derivatives into Bayesian optimization, modelling the derivatives of a GP as in Rasmussen and Williams 2006, Sect Lizotte 2008 shows that Bayesian optimization with the expected improvement () acquisition function and complete gradient information at each sample can outperform BFGS. Our approach has six key differences: (i) we allow for noisy and incomplete derivative information; (ii) we develop a novel acquisition function that outperforms with derivatives; (iii) we enable batch evaluations; (iv) we implement and compare batch Bayesian optimization with derivatives across several acquisition functions, on benchmarks and new applications such as kernel learning, logistic regression, and k-nearest neighbors, further revealing empirically where gradient information will be most valuable; (v) we provide a theoretical analysis of Bayesian optimization with derivatives; (vi) we develop a scalable implementation. Recently, several batch Bayesian optimization algorithms have been proposed that in each iteration choose a set of points rather than a single point at which the function is evaluated. Within this area, our approach to handling batch observations is most closely related to the batch knowledge gradient () of Wu and Frazier Other work in this area includes Snoek et al. 2012, Wang et al. 2016, Marmin et al. 2016, who extended the expected improvement acquisition criterion to the batch setting. Batch acquisition algorithms can also be developed from upper confidence bounds Contal et al., 2013, Desautels et al., 2014, Kathuria et al., 2016 or entropy search Shah and Ghahramani, Another recently proposed method is the Local Penalization (LP) Gonzalez et al. 2016, which assumes that the function is Lipschitz continuous and tries to estimate the Lipschitz constant. 3 Knowledge Gradient with Derivatives In Sect. 3.1 we discuss a general approach to incorporating derivative information into Gaussian processes for Bayesian optimization. In Sect. 3.2, we introduce a novel acquisition function, based on the knowledge gradient acquisition function, which utilizes derivative information. In Sect. 3.3, we show that this algorithm provides more value of information than in the derivative-free setting, is one-step Bayes-optimal, and is asymptotically consistent when used over a discretized feasible space. We then detail how to implement the algorithm efficiently in Sect Derivative Information Given an expensive-to-evaluate function f, our goal is to find an argmin f(x), where A R d is the domain of optimization. We place a Gaussian process prior over the function f : A R, which is specified by its mean function µ( ) : A R and the kernel function K(, ) : A A R 0, where R 0 denotes the nonnegative reals. We initially suppose that for each sample we observe the function value and all d partial derivatives, and then later show how to relax this assumption. For x A we denote the function value by f(x) and the gradient by f(x). We jointly model the function and its gradient via a multi-output Gaussian process with mean function µ and kernel function K defined as follows: µ(x) = (µ(x), µ(x)) T, ( ) K(x, x K(x, x ) = ) J(x, x ) J(x, x) T H(x, x ) (3.1) 2

3 ( K(x,x ) ) where J(x, x ) = x,, K(x,x ) 1 x and d H(x, x ) is the d d Hessian of K(x, x ). Since the gradient is a linear operator, the gradient of a GP is also a GP (see also Sect. 9.4 in Rasmussen and Williams 2006). We are particularly interested in the ability of acquisition algorithms to leverage noisy observations of partial derivatives. Accordingly, we suppose that the observations of the function value and the gradient are subject to noise. That is, when evaluating f(x) at point x, we observe the (d + 1)-dimensional vector (( ) ) f(x) y(x) (f(x), f(x)) N, diag(σ f(x) 2 (x)), where σ 2 : A R d+1 0 gives the variance of the observational noise at each point for the function value and its d partial derivatives. Then diag(σ 2 (x)) is the diagonal matrix that gives the variance for each observation, i.e. either of the function f or of a partial derivative. If σ 2 is not known, we will estimate it from data. Note that the posterior distribution is again a GP with mean function µ (n) ( ) and kernel function K (n) (, ). Their formulae are given in the supplementary material for completeness. To relax the assumption of complete derivatives, we note that if some entries of (f(x), f(x)) are not provided, then the remaining values associated with x still obey the multivariate normal distribution imposed by the GP. Thus, we may simply omit the entries of the mean vector corresponding to outputs of the GP that are not available. Accordingly, we omit the rows and columns of the covariance matrix that correspond to values that were not provided. 3.2 The Acquisition Algorithm We propose a novel Bayesian optimization algorithm to exploit available derivative information, based on the knowledge gradient approach Frazier et al., We refer to this algorithm as the derivative-enabled knowledge gradient (). The algorithm proceeds iteratively: in each iteration selects a batch of q points in A that has a maximum value of information (VOI). Suppose that we have observed n points and let µ (n) (x) for each x A be the (d + 1)-dimensional vector that gives the posterior mean for f(x) and its d partial derivatives at x. Sect. 3.1 discusses how to remove the assumption that all d + 1 values are provided. The expected value of f(x) under the posterior distribution is given by e T 1 µ (n) (x), where e 1 is the (d + 1)-dimensional vector whose first entry is one and other entries are zero. If we were to make an irrevocable (risk-neutral) decision now, we would pick an argmin e T 1 µ (n) (x) (for a minimization problem). Therefore, we define the factor for a given set of q candidate points z (1:q) as (z (1:q), A) = min et 1 µ (n) (x) E n min et 1 µ (n+q) (x) y(z (1:q) ), (3.2) where E n is the expectation taken with respect to the posterior distribution after n evaluations, and y(z (1:q) ) are the observations of both the function values and partial derivatives at the points z (1:q). We subsequently refer to Eq. (3.2) as the inner optimization problem. Crucially, the factor takes the posterior distribution over the derivatives at the points z (1:q) into account by conditioning on y(z (1:q) ), although Eq. (3.2) is formulated as the difference between posterior means under the function f. Then, from a conceptual point of view, we could choose to evaluate the batch of points to evaluate next that maximizes the factor, max (z (1:q), A). (3.3) z (1:q) A We refer to Eq. (3.3) as the outer optimization problem. In practice, including all d partial derivatives can be prohibitive since GP inference scales with all partial derivatives as O(n 3 (d + 1) 3 ). However, we may only want to include one directional derivative each iteration Ahmed et al., can naturally tell which derivative to choose and how it affects the acquisition function. We define the acquisition function by only conditioning on the function value and ith derivative at z (1:q) for 1 i d, (z (1:q), i, A) = min et 1 µ (n) (x) E n min et 1 µ (n+q) (x) y {1,i+1} (z (1:q) ), (3.4) where y 1 (x) is the observed function value at x, and y 2:(d+1) (x) are the d derivative observations at x accordingly. Eq. (3.4) characterizes the value of information if we only observe the function value and its ith partial derivative at z (1:q). The full algorithm is as follows. Algorithm 1 with Relevant Derivative Detection 1: for t = 1 to N do 2: (z (1:q), i ) = argmax z (1:q),i(z (1:q), i, A) 3: Observe y {1,i +1}(z (1:q) ), Update the posterior distribution of (f(x), f(x)). 4: end for Return x = argmin e T 1 µ Nq (x) 3

4 Algorithm 1 requires solving d continuous optimization problems, which scales linearly with d. 3.3 Theoretical Analysis Here we present three theoretical results giving insight into the properties of. We provide all proofs in the supplementary material. In this section, we are analyzing with all available derivatives for simplicity. However, one can prove similar results for with relevant derivative detection. The following proposition shows that the VOI obtained by exceeds the VOI possible in the derivative-free setting. Proposition 1. Given identical posteriors µ (n), (z (1:q), A) (z (1:q), A), where is the batch knowledge gradient acqusition function without gradients proposed in Wu and Frazier By construction, is one-step Bayes-optimal, as stated in Theorem 1. Theorem 1. If only one iteration is left and we can observe both function values and its partial derivatives, then is Bayes-optimal among all feasible policies. As a complement to one-step optimality, we show that is asymptotically consistent when the feasible set A is finite. Asymptotic consistency means that will choose the correct solution when the number of iterations goes to infinity. Theorem 2. The algorithm is asymptotically consistent, i.e. lim N f(x (, N)) = min f(x) almost surely where x (, N) is the point recommended by after N iterations. 3.4 An Efficient Approximation of Recall that the maximization of is difficult since each evaluation of the objective function (z (1:q), i, A) requires an optimal solution to the optimization problem in Eq. (3.4) that is stated over the continuous space A. To make this problem tractable in practice, we propose a novel discretization that improves over Wu and Frazier Then we can compute the factor and its gradient over the discrete set, which allows us to optimize efficiently via a gradient-based optimizer. We provide computational details in the supplementary material. An Novel Discretization of A. We discretize the set A in the optimization problem stated in Eq. (3.4). For example, one can draw M samples from the posterior over the global maximizer (please refer to the appendix for a description of this technique). This sample set, denoted by A M n, is then extended by the location of the minimum posterior mean x n where x n = argmin e T 1 µ (n) (x) and the set of points z (1:q) whose value of information we wish to compute. Then the optimization problem in Eq. (3.4) can be restated as (z (1:q), i, A n ) = min n e T 1 µ (n) (x) E n min n e T 1 µ (n+q) (x) y {1,i+1} (z (1:q) ) where A n = A M n {x n} z (1:q). Bayesian Treatment of Hyperparameters. We adopt the full Bayesian treatment of hyperparameters. We draw K samples of hyperparameters φ (i) for 1 i K via slice sampling Neal, Then let A (i) n be the discrete set under hyperparameters φ (i), the integrated acquisition function is n (z (1:q), i) = 1 K (z (1:q), i, A (i) n ). (3.5) K i=1 3.5 An Illustration of Incorporating Derivative Information We examine how observing derivative information affects the posterior distribution and the value of information analyses of the knowledge gradient and the expected improvement criteria. First, we formally define d (derivative-enabled ) as follows ( ) + d(z (1:q) ) = min y (1:n) 1 min e T x z (1:q) 1 µ (n+q) (x). The two topmost plots of Fig. 1 depict the posterior surfaces of a function sampled from a one dimensional Gaussian process (without taking into account partial derivatives, on the left-hand side) and after incorporating observations of the full respective gradients at the sample locations (on the right-hand side). We see that the uncertainty is considerably reduced if derivative information is taken into account. The two plots in the second row illustrate how the acquisition criteria of the knowledge gradient and expected improvement are affected by including derivative information. Here we suppose a batch size of one. Note that,, and even d pick essentially the same location for the next sample, where prefers a different sample. 4

5 The plots in the third and fourth row show the posterior surface after observing the next sample chosen by the respective acquisition criterion. We see that the posterior uncertainty is smaller away from the global optimum for the algorithms that utilize the gradient observations than for those that do not. Interestingly, we see that the knowledge gradient seems to benefit considerably more from derivative information than expected improvement (fourth row): has sampled a point whose observation gives an accurate knowledge of the location of the optimum, while d still is forced to make a greedy sampling decision. We will investigate this observation in more detail in our experimental evaluation. 4 Experiments We evaluate the performance of the proposed algorithm with relevant derivative detection (Algorithm 1) on six standard synthetic benchmarks in Sect Moreover, we examine its ability to tune the hyperparameters for the weighted k-nearest neighbor algorithm (KNN) (see Sect. 4.2), logistic regression (Sect. 4.3), and for a spectral mixture kernel (cp. Sect. 4.4). Note that in the former two applications not all hyperparameters are differentiable. We compare its performance to state-of-the-art methods in Bayesian optimization: The batch expected improvement method () of Wang et al that does not utilize derivative information. Our extension of the above batch expected improvement method that incorporates derivative information (d). The batch GP--PE method of Contal et al that does not utilize derivative information, our extension of the above batch method that incorporates derivative information. The batch knowledge gradient algorithm without derivative information () of Wu and Frazier All of the above algorithms can be run even if not all partial derivatives are given. In benchmarks that provide the full gradient, we additionally compare to the gradient-based method L-BFGS-B provided in scipy. We suppose that the objective function f is drawn from a Gaussian process GP (µ, Σ), where µ is a constant mean function and Σ is the squared exponential kernel. We sample K = 100 sets of hyperparameters by slice sampling. The parameter M that determines the number of samples drawn from the posterior over the global maximizer is set to 10 (cp. Sect. 3.4). Recall that the immediate regret is defined as the loss with respect to a global optimum. The plots for synthetic benchmark functions, shown in the supplementary material, report the immediate regret of the solution that each algorithm would pick as a function of the number of. For the other experiments the plots depict the objective value of the solution instead of the immediate regret. The error bars give the mean value plus and minus one standard deviation. The number of replications varies and is stated in the description of the respective benchmark below. We implemented our method in C++ with a Python interface. We have made code available at wujian16/q/tree/jianwu_9_cpp gradients. 4.1 Results on Synthetic Functions We evaluate all methods on six test functions chosen from Bingham In order to demonstrate the ability to benefit from noisy derivative information, we sample additive normally distributed noise with zero mean and variance σ 2 = 0.25 for both the objective function and its partial derivatives. Note that σ is not known to the algorithms but has to be estimated from observations. Moreover, we investigate how the performance of the algorithms is affected if partial derivatives are not given for all parameters. We also experiment with two different batch sizes: we use a batch size q = 4 for the Branin, Rosenbrock, and Ackley functions; otherwise, we use a batch size q = 8. The experimental results are summarized in Fig. 5 of the supplementary material. Functions with Full Gradient Information. For 2d Branin on domain 15, 15 2, 5d Ackley on 2, 2 5, 6d Hartmann function on 0, 1 6, we assume that the full gradient is available. Looking at the results for the Branin function (cp. Fig. 5 in the supplementary material), outperforms its competitors after 40 and obtains the best solution overall (within the limit of ). BFGS makes faster progress than the Bayesian optimization methods during the first 20 evaluations, but subsequently stalls and fails to obtain a competitive solution. On the Ackley function d makes fast progress during the first 50 evaluations but also fails to make any subsequent progress. Conversely, requires about 50 evaluations to improve on the performance of d; exhibits the best overall performance again. For the Hartmann function clearly dominates its competitors over all. Functions with Incomplete Derivative Information. For the 3d Rosenbrock function on 2, 2 3 we 5

6 3 posterior without gradient observations posterior with gradient observations without Gradient without Gradient Best point by without Gradient Best point by without Gradient with Gradient with Gradient Best point by with Gradient Best point by with Gradient posterior after evaluating the point by 2 posterior after evaluating the point by posterior after evaluating the point by d posterior after evaluating the point by Figure 1: The topmost plots show the posterior surfaces of a function sampled from a one dimensional Gaussian process with and without incorporating observations of the gradients. Note that the posterior variance is considerably smaller if the gradients are incorporated. The plots in the second row show the utility of sampling each point under the value of information criteria of and in both settings. If no derivatives are observed, both and will query a point with high potential gain (i.e. a small expected function value). On the other hand, when gradients are observed, makes a considerably better sampling decision, whereas d samples essentially the same location as. The plots in the third and fourth row depict the posterior surface after the respective sample. Interestingly, benefits more from observing the gradients than (fourth row): samples a point whose observation yields an accurate knowledge of the location of the optimum, while d still has considerable uncertainty around the optimum. 6

7 only provide a noisy observation of the third partial derivative. Both and d get stuck early. on the other hand finds a near optimal solution after about 50 ; catches up after about 75 evaluations and has a comparable performance afterwards. The 4d Levy benchmark on 10, 10 4, where the fourth partial derivative is observable with noise, shows a different ordering of the algorithms: here has the best performance, beating even its formulation that utilizes derivative information. A possible explanation could be that the smoothness and regularized shape of the function surface benefits this acquisition criterion. For the 8d Cosine mixture function on 1, 1 8 we provide two noisy partial derivatives. and with derivatives perform better than -type criterion, and achieve the best performances, with beating with derivatives slightly. Summing up, we see that successfully exploits noisy derivative information and has the best overall performance. 4.2 Weighted k-nearest Neighbor Suppose a cab company wishes to predict the duration of trips for its vehicles and customers. Clearly, the duration not only depends on the endpoints of the trip, but also on the day and time. In this benchmark we tune a weighted k-nearest neighbor (KNN) metric to optimize predictions of these durations, based on historical data. A trip is described by the pick-up time t, the pick-up location (p 1, p 2 ), and the drop-off point (d 1, d 2 ). Then the estimate of the duration is obtained as a weighted average over all trips D m,t in our database that happened in the time interval t ± m minutes, where m is a tunable hyperparameter: Prediction(t, p 1, p 2, d 1, d 2 ) i D = m,t duration i weight(i). i D m,t weight(i) The weight of trip i D m,t in this prediction is given by ( (t t i ) 2 weight(i) = l1 2 + (p 1 p i 1) 2 l2 2 + (p 2 p i 2) 2 l3 2 + (d 1 d i 1) 2 l4 2 + (d 2 d i 2) 2 ) 1 l5 2, where (t i, p i 1, p i 2, d i 1, d i 2) are the respective parameter values for trip i, and (l 1, l 2, l 3, l 4, l 5 ) are tunable hyperparameters. Thus, we have 6 hyperparameters to tune: (m, l 1, l 2, l 3, l 4, l 5 ). We choose m in 30, 200, l 2 1 in 10 1, 10 8, and l 2 2, l 2 3, l 2 4, l 2 5 each in 10 8, We use the yellow cab NYC public data set from June 2016, sampling records from June 1st to June 25th as training data and 1000 trip records from June 26th to 30th as validation data. Our test criterion is the root mean squared error (RMSE), for which we compute the partial derivatives on the validation dataset with respect to the hyperparameters (l 1, l 2, l 3, l 4, l 5 ), while the hyperparameter m is not differentiable. The experimental results show that performs considerably better than all the other competing algorithms eventually. For and acquisition functions, exploiting derivative information provides an advantage. Fig. 2 summarizes the results for batch size q = 8. the function value d Figure 2: We tune 6 hyperparameters in the weighted KNN with batch size 8 where the first 5 derivatives available. We report the best function value for the KNN benchmark, averaged over 20 replications. 4.3 Logistic Regression We tune logistic regression on the MNIST dataset Le- Cun et al., The task is to classify handwritten digits from images. We train the algorithm on images with a given set of hyperparameters. The test set consists of images. We tune 4 hyperparameters: the l2 regularization parameter from 0 to 1, learning rate from 0 to 1, mini batch size from 20 to 2000 and training epochs from 5 to 50. The first two derivatives (the l2 regularization parameter and the learning rate) are available by Maclaurin et al We report the mean and standard deviation of the test loss for 20 independent runs. and d outperform the other approaches, which suggests that derivative information is helpful in this application. The logistic regression can be seen as a neural network with no hidden layers. Thus, this example indicates that our algorithm can be useful to tune deep neural networks if the gradient of hyperparameters can be computed ef- 7

8 ficiently Maclaurin et al., 2015, Luketina et al., 2015, Fu et al., the negative entropy loss on the test set d the log scale of the function value d 8-start L-BFGS-B Figure 3: We tune logistic regression (4 hyperparameters) with batch size 8 where the first 2 derivatives available. We report the negative entropy loss on the test set, averaged over 20 replications. 4.4 Kernel Learning We examine the performance of the optimization algorithms for a complex kernel learning task. Although we have access to an analytic closed form (marginal likelihood) objective, this objective is (i) expensive to evaluate, (ii) highly multimodal, and (iii) derivative information is available. Thus learning flexible kernel functions is a perfect candidate for our approach. Spectral mixture kernels Wilson and Adams, 2013 can be used for flexible kernel learning to enable longterm extrapolation. These kernels are obtained by modeling a spectral density by a mixture of Gaussians. While any stationary kernel can be described by a spectral mixture kernel with a particular setting of its hyperparameters, initializing and learning these parameters can be difficult, due to a highly multimodal marginal likelihood objective. In this experiment, the task is to train a 3- component spectral mixture kernel on an airline data set used by Wilson and Adams We have to determine the mixture weights, means, and variances, for each of the three Gaussians. We run the algorithms with batch size q = 8 on this highly multi-modal function. Their performance is summarized in Fig. 4. On this application, BFGS tends to either perform reasonably well, or become trapped in a bad local optima, depending highly on initialization and human intervention., on other hand, can more consistently find a good solution. Here finds the best solution within the step limit. Overall, we observe that gradient information is highly valuable in performing this kernel learning task. Figure 4: Top: The average performance for the spectral mixture kernel benchmark over 20 replications. Bottom: The test performance of one final recommendation of hyperparameters found by. 5 Discussion Bayesian optimization is primarily applied to low dimensional problems where we wish to find a good solution with a very small number of objective function evaluations. We considered several such benchmarks, as well as logistic regression, kernel learning, and k-nearest neighbor applications. We have shown that in this context derivative information can be extremely useful: we can greatly decrease the number of objective, especially when building upon the knowledge gradient acquisition function, even when derivative information is noisy and only available for some variables. Bayesian optimization is increasingly being used to automate parameter tuning in machine learning, where objective functions can be extremely expensive to evaluate. For example, the parameters could even represent the hyperparameters of a deep neural network. We expect derivative information with Bayesian optimization to help enable such promising applica- 8

9 tions, moving us towards fully automatic and principled approaches to statistical machine learning. In the future, one could combine derivative information with flexible deep projections Wilson et al., 2016, and recent advances in scalable Gaussian processes for O(n) training and O(1) test time predictions Wilson and Nickisch, 2015, Wilson et al., These steps would help make Bayesian optimization applicable to a much wider range of problems, wherever standard gradient based optimizers are used even when we have analytic objective functions that are not expensive to evaluate while retaining faster convergence and robustness to multimodality. Acknowledgments Wilson was partially supported by NSF IIS Frazier, Poloczek, and Wu were partially supported by NSF CAREER CMMI , NSF CMMI , NSF IIS , AFOSR FA , AFOSR FA , and AFOSR FA

10 A The Posterior Distribution of the Multivariate GP Suppose that we have sampled f at n points X := {x (1), x (2),, x (n) } so far and observed y (1:n), where each observation consists of the function value and the gradient at x (i). Then the posterior distribution is a multivariate Gaussian process with mean function µ n ( ) and kernel function K n (, ), where µ (n) (x) = µ(x) + K(x, X) ( K(X, X) +diag{σ 2 (x (1) ),, σ 2 (x (n) )}) 1 (y (1:n) µ(x)), K (n) (x 1, x 2 ) = K(x 1, x 2 ) K(x 1, X) ( K(X, X) +diag{σ 2 (x (1) ),, σ 2 (x (n) )}) 1 K(X, x2 ). (A.1) The rows and columns in Eq. (A.1) corresponding to partial derivatives (or function values) that were not observed are to be omitted. B Spectral Density Approximation of the Gaussian Process In this paper, we use random features to approximate a Gaussian process to obtain a better discretization of set A used in the inner optimization problem of (see Sect. 3.4), and improve the scalability of kernel learning, thereby following ideas of Hernández-Lobato et al. 2014, Lázaro-Gredilla et al Denote by s(w) the Fourier dual of a stationary kernel function and p(w) := s(w)/α the associated normalized density, where α = s(w)dw. We approximate the Gaussian process with a finite set of m random features, specifically, K(x 1, x 2 ) = 2α m E p(w,b) cos(w x 1 + b) cos(w x 2 + b), where W is a m d random matrix with W ij p(w) and b is a m 1 random vector with b i U(0, 2π) Hernández-Lobato et al., 2014, Sect. A. Let Φ(x) = 2α/m cos(w x + b). We approximate the Gaussian process prior for f via a Bayesian linear model f(x) = Φ(x) T θ, where θ N (0, I). Conditioned on the collected data, the posterior of the θ is multivariate normal with mean and covariance m = (Φ T Φ + Σ) 1 Φ T y V = (Φ T Φ + Σ) 1 Σ, where Σ = diag(σ 2 (x)) denotes the variance for observations of function values and partial derivatives (see Sect. 3.1). To sample from the posterior of the global maxima, we first sample m random features Φ (i) (x) and their corresponding weights θ(i), and then construct f (i) (x) = Φ (i) (x) T θ(i). This is a sample from the approximate posterior of f conditioned on the data, on which we locate global optima using a gradient-based optimizer (see also Sect. 2.1 in Hernández-Lobato et al and Sect. 3.2 in Shah and Ghahramani 2015). Recall that we sample the hyperparameters of the kernel via slice sampling regularly as more observations are obtained. To speed up the kernel learning when the number of samples n exceeds m, we apply the above approximation. The log-likelihood of a set of hyperparameters is log det (ΦΦ T + Σ) y T ΦΦ T + Σ 1 y. With this approximation, the computation time is O(m 2 n) instead of O(n 3 ). C Proof of Proposition 1 and Theorem 1 Proof of Proposition 1. Recall that we start with the same posterior µ (n), then (z (1:q), A) = min et 1 µ (n) (x) E n min E n y 1 (x) y(z ) (1:q), = min et 1 µ (n) (x) E n E n min E n y 1 (x) y(z ) (1:q) y 1 (z (1:q) ), min et 1 µ (n) (x) E n min E n E n y 1 (x) y(z ) (1:q) y 1 (z (1:q) ), = min et 1 µ (n) (x) E n min E n y 1 (x) y 1 (z ) (1:q), = (z (1:q), A), (C.1) where recall that y 1 (x) is the observed function value at x, and y 2:(d+1) (x) are the d derivative observations at x accordingly. The inequality above holds due to Jensen s inequality. Next we analyze the Bayesian optimization problem under the dynamic programming (DP) framework and show that is one-step Bayes-optimal. Proof of Theorem 1. Suppose that we are given N iteration budgets, our goal is to choose sampling decisions ({z i, 1 i Nq} and implementation decision z Nq+1 that minimizes f(z Nq+1 ). We assume 10

11 that (f(x), f(x)) is drawn from the prior GP( µ, K), then (f(x), f(x)) also follows the posterior process GP( µ (Nq), K (Nq) ) after N iterations, so we have E Nq (f(z Nq+1 )) = e T 1 µ (Nq) (z Nq+1 ). Thus, letting Π be the set of feasible policies π, we can formulate our problem as follows inf π Π Eπ min et 1 µ (Nq) (x). We analyze this problem under the DP framework. We define our state space as S n := ( µ (nq), K (nq) ) after iteration n as it completely characterizes our belief on f. Under the DP framework, we need to define the value function V n as follows V n (s) := inf π Π Eπ min et 1 µ (Nq) (x) S n = s (C.2) for every s = (µ, K). The bellman equation tells us that the value function can be written recursively by where V n (s) = min z A q Qn (s, z) Q n (s, z) = E V n+1 (S n+1 ) S n = s, z ((nq+1):(n+1)q) = z At the same time, we also know that any policy π whose decision satisfy Z π,n (s) argmin z A qq n (s, z) (C.3) is optimal. If we were to stop at iteration n + 1, then V n+1 (S n+1 ) = min e T 1 µ ((n+1)q) (x) and (C.3) reduces to Z π,n (s) argmin z A qe min et 1 µ ((n+1)q) (x) S n = s, z ((nq+1):(n+1)q) = z = argmax z A q min et 1 µ (nq) (x) E min et 1 µ ((n+1)q) (x) S n = s, z ((nq+1):(n+1)q) = z, which is exactly the algorithm. This proves that is one-step Bayes-optimal. D The Computation of and its Gradient In this section we show how the (z (1:q), A n ) factor can be computed efficiently, using the discretization in Section 3.4 of the main document. The (z (1:q), i, A n ) and its gradient can be computed analogously. Recall that K (n) and µ (n) are the kernel and mean function respectively of the posterior after evaluating n points. It is well-known (e.g., see Frazier et al. 2009, Wu and Frazier 2016) that, conditioned on z (1:q) and the knowledge after n evaluations, y(z (1:q) ) µ (n) (z (1:q) ) is normally distributed with zero mean and covariance matrix K (n) (z (1:q), z (1:q) ) + diag{σ 2 (z (1) ),, σ 2 (z (q) )}. Recall that y(z (1:q) ) contains the function value and the d partial derivatives for each of the q points in the batch. Following Wu and Frazier 2016, we express µ (n+q) (x) as µ (n+q) (x) = µ (n) (x) + K (n) (x, z (1:q) ) ( K(n) (z (1:q), z (1:q) ) ) 1 +diag{σ 2 (z (1) ),, σ 2 (z (q) )} ( ) y(z (1:q) ) µ (n) (z (1:q) ). Thus, we can rewrite µ (n+q) (x) as µ (n+q) (x) = µ (n) (x) + σ (n) (x, z (1:q) )Z q(d+1) (D.1), where Z q(d+1) is a q (d+1)-dimensional standard normal vector and σ (n) (x, z (1:q) ) = K (n) (x, z (1:q) ) ( D(n) (z (1:q) ) T ) 1. Here D(n) (z (1:q) ) is the Cholesky factor of the covariance matrix K(n) (z (1:q), z (1:q) ) + diag{σ 2 (z (1) ),, σ 2 (z (q) )}. Now we can compute the factor using Monte Carlo sampling. To compute the gradient of the factor, we apply infinitesimal perturbation analysis (IPA), which allows us to exchange the expectation operator and the gradient operator (see Wang et al for further details). Specifically, by Eq. (D.1), we can rewrite the expression of the approximate factor as (z (1:q), A n ) = min n e T 1 µ (n) (x) E n = E Zq(1+d) min e T 1 µ (n) (x) n Now let min n e T 1 min e T 1 µ (n+q) (x) y(z (1:q) ), n ( µ (n) (x) + σ (n) (x, z (1:q) )Z q(d+1) ). x 1, = argmin n e T 1 µ (n) (x) and ) x 2, = argmin n e T 1 ( µ (n) (x) + σ (n) (x, z (1:q) )Z q(d+1), 11

12 then the partial derivative of (z (1:q), A n ) with respect to z ij is (z (1:q), A n ) z ij = E Zq(1+d) e T 1 µ (n) (x 1, ) z ij e T 1 z ij ( µ (n) (x 2, ) + σ (n) (x 2,, z (1:q) )Z q(d+1) ) The following lemma is related to the optimal policy. It says that if allowed an extra fixed batch of samples, the optimal policy performs better on average than if no extra samples allowed. Lemma 2. For any state s and z A, Q n (s, x) V n+1 (s). As a direct corollary, we have V n (s) V n+1 (s) for any state s. where z ij is the j-th dimension of i-th point in z (1:q). Therefore, we can utilize a multi-start gradient-descent to select the next batch. E Proof of Theorem 2 At the beginning of this section, we will state two results concerning the benefits of additional samples, which will be useful in the latter proofs. Recall that we define the value function in Eq. (E.1). Similarly, we can define the value function for a specific policy π as V n (π, s) := E min π et 1 µ (Nq) (x) S n = s. (E.1) Since we are varying the number of iterations N, we define V 0 (s; N) as the optimal value function when the number of iteration budgets is N. Additionally, we define V (s; ) := lim N V 0 (s; N). Similarly, we define V 0,π (s; N) and V π (s; ) for a specific policy π. Policy π is asymptotically consistent if V π (s; ) = V (s; ). We have the following result for any stationary policy π. Lemma 1. For any stationary policy π and state s, V π,n (s) V π,n+1 (s). This lemma states that for any stationary policy, one additional iteration helps on average. Proof of Lemma 1. We prove by induction on n. When n = N 1, by Jensen s inequality, V π,n 1 (s) = E s min µ (Nq) (x) y(z (((N 1)q+1):(Nq)) ) x min E s µ (Nq) (x) y(z (((N 1)q+1):(Nq)) ) x = V π,n (s). Then by the induction hypothesis, V π,n (s) = E s V π,n+1 (s n+1 y(z (((N 1)q+1):(Nq)) ) E s V π,n+1 (s n+2 y(z (((N 1)q+1):(Nq)) ) = V π,n+1 (s). We concludes the proof. Proof of Lemma 2. The proof of Lemma 2 is quite similar to that of Lemma 1. We omit the details here. The lemma below shows that V (s; ) is well defined and bounded below. Lemma 3. For any state s, V (s; ) exists and V (s; ) U(s) := E min f(x) S0 = s. (E.2) Proof of Lemma 3. We will show that V 0 (S 0 ; N) is non-increasing of N and bounded below from U(S 0 ). This will imply that V 0 (S 0 ; ) exists and is bounded below from U(S 0 ). To prove that V 0 (S 0 ; N) is nonincreasing of N, we note that V 0 (S 0 ; N) V 0 (S 0 ; N 1) = V 0 (S 0 ; N) V 1 (S 0 ; N) 0. To show that V 0 (S 0 ; N) is bounded below from U(S 0 ), for every N 1 and policy π, E π min e T 1 µ (Nq) (x) x = E π min E π N f(x) x E π E π N min f(x) x = E π min f(x) x = E min f(x) = U(S 0 ). x Thus we have V 0 (S 0 ; N) U(S 0 ). Taking the limit N, we have V (S 0, ) U(S 0 ). We will now show that V π (S 0 ; ) exists for each stationary policy. The proof is similar as above. We can show that V π,0 (S 0 ; N) is non-increasing in N and bounded below from U(S 0 ). Hence, V π (S 0 ; ) exists. A policy is called stationary if the decision of the policy only depends on the current state S n := ( µ (n), K (n) ) (not related to which iteration it is after, i.e. n). is stationary. The following lemma is the key idea to prove the asymptotic consistency. 12

13 Lemma 4. If a stationary policy π measures every alternative x A infinitely often almost surely, then π is asymptotically consistent and has value U(s). Proof of Lemma 4. We assume that the measurement noise is of finite variance, it implies that the posterior sequence µ (Nq) converges to true surface f by vector-version strong law of large numbers if we sample every alternative infinitely often. Thus, lim N µ (Nq) = f a.s., and lim N min e T 1 µ (Nq) (x) = min f(x) in probability. Next we will show that min e T 1 µ (Nq) (x) is uniformly integrable in N, which implies that min e T 1 µ (Nq) (x) converges in L 1. For a fixed K 0, we have E min et 1 µ (Nq) (x) 1 { min e T 1 µ(nq) (x) K} E max et 1 µ (Nq) (x) 1 {max e T 1 µ(nq) (x) K} = E max E Nq(f(x)) 1 {max E Nq (f(x)) K} E max E Nq( f(x) )1 {max E Nq ( f(x) ) K} E E Nq (max f(x) )1 {E Nq (max f(x) ) K} ( ) = E E Nq max f(x) 1 {E Nq (max f(x) ) K} = E max f(x) 1 {E Nq (max f(x) ) K}. By the similar proof with Lemma A.5 in Frazier et al. 2009, we can show that S n converges to a random variable S := ( µ, K ) as n increases. By definition, = min x V N (S ) Q N 1 (S ; x) E e T 1 µ (x) ( ) min e T 1 µ (x) + e T 1 σ (x, z (1:q) )Z q(d+1) x If we have measured x infinitely often, there will be no uncertainty around f(x) in S, then V N (S ) = Q N 1 (S ; x). If we have not measured x infinitely often, then V N (S ) > Q N 1 (S ; x), i.e. there are benefits measuring x. We define E = {x A : the number of times measuring x < }, then for any x E and y E c, we have Q N 1 (S ; x) < V N (S ) = Q N 1 (S ; y). By the definition of, it will measure some x E, i.e. at least one of x in E is measured infinitely often, a contradiction. F Detailed Results on Synthetic Test Functions In this section, we plot the results on six synthetic functions in Fig. 5. Recall that we plot the immediate regret of the solution that each algorithm would pick as a function of the number of. Since max f(x) is integrable and P (max f(x) ) K) E(max x f(x) )/K is bounded uniformly in N and goes to zeros as K increases to infinity, Given that min e T 1 µ (Nq) (x) converges in L 1, we have V π (S 0 ; ) = lim N Eπ min et 1 µ (Nq) (x) = E π lim min N et 1 µ (Nq) (x) = E π min f(x) = U(S 0 ). So by Lemma 3 in the main document, we concludes that V π (S 0 ; ) = V (S 0 ; ) = U(S 0 ). Then we will show that measures every alternative x A infinitely often when N goes to infinity, which leads to the proof of Theorem 2. Proof of Theorem 2. Note that is a stationary policy, we only need to show that algorithm samples every alternative infinitely often if N goes to infinity. 13

14 the log10 scale of the immediate regret 2.0 2d Branin function with batch size 4: noisy full gradient available d 4-start L-BFGS-B the log10 scale of the immediate regret d Rosenbrock function with batch size 4: noisy 3rd derivative available d the log10 scale of the immediate regret d Levy function with batch size 8: noisy 4th derivative available d the log10 scale of the immediate regret 5d Ackley function with batch size 4: noisy full gradient available d 4-start L-BFGS-B d Hartmann function with batch size 8: noisy full gradient available 0.4 8d Cosine function with batch size 8: noisy 1st and 2nd derivatives available the log10 scale of the immediate regret d 8-start L-BFGS-B the log10 scale of the immediate regret d Figure 5: The average performance of 100 replications (the log10 of the immediate regret vs. the number of ). For the Branin, Ackley, and Hartmann functions, we assume that a noisy observation of the full gradient is available. On the other functions only one or two partial derivatives can be observed (with noise). performs significantly better than its competitors for all benchmarks except the Levy function. 14

15 References NYC Trip Record Data. tlc/, Last accessed on M. O. Ahmed, B. Shahriari, and M. Schmidt. Do we need harmless bayesian optimization and firstorder bayesian optimization? D. Bingham. Optimization test problems. http: // E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arxiv preprint arxiv: , E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, pages Springer, T. Desautels, A. Krause, and J. W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1): , A. Forrester, A. Sobester, and A. Keane. Engineering design via surrogate modelling: a practical guide. John Wiley & Sons, P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4): , J. Fu, H. Luo, J. Feng, and T.-S. Chua. Distilling reverse-mode automatic differentiation (drmad) for optimizing hyperparameters of deep neural networks. arxiv preprint arxiv: , J. R. Gardner, M. J. Kusner, Z. E. Xu, K. Q. Weinberger, and J. Cunningham. Bayesian optimization with inequality constraints. In ICML, pages , M. Gelbart, J. Snoek, and R. Adams. Bayesian optimization with unknown constraints. In ICML, pages , Corvallis, Oregon, AUAI Press. J. Gonzalez, Z. Dai, P. Hennig, and N. Lawrence. Batch bayesian optimization via local penalization. In AISTATS, pages , J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems, pages , D. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global Optimization of Stochastic Black-Box Systems via Sequential Kriging Meta-Models. Journal of Global Optimization, 34(3): , A. Jameson. Re-engineering the design process through computation. Journal of Aircraft, 36(1): 36 50, D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4): , T. Kathuria, A. Deshpande, and P. Kohli. Batched gaussian process bandit optimization via determinantal point processes. In Advances in Neural Information Processing Systems, pages , J. P. Kleijnen. Simulation-optimization via kriging and bootstrapping: A survey. Journal of Simulation, 8 (4): , M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 11: , Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, D. J. Lizotte. Practical bayesian optimization. PhD thesis, J. Luketina, M. Berglund, and T. Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. arxiv preprint arxiv: , D. Maclaurin, D. Duvenaud, and R. P. Adams. Gradient-based hyperparameter optimization through reversible learning. In ICML, S. Marmin, C. Chevalier, and D. Ginsbourger. Efficient batch-sequential bayesian optimization with moments of truncated gaussian vectors. arxiv preprint arxiv: , R. M. Neal. Slice sampling. Annals of statistics, pages , M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In 3rd international conference on learning and intelligent optimization (LION3), pages Citeseer, V. Picheny, D. Ginsbourger, Y. Richet, and G. Caplin. Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics, 55 (1):2 13,

arxiv: v3 [stat.ml] 7 Feb 2018

arxiv: v3 [stat.ml] 7 Feb 2018 Bayesian Optimization with Gradients Jian Wu Matthias Poloczek Andrew Gordon Wilson Peter I. Frazier Cornell University, University of Arizona arxiv:703.04389v3 stat.ml 7 Feb 08 Abstract Bayesian optimization

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Knowledge-Gradient Methods for Bayesian Optimization

Knowledge-Gradient Methods for Bayesian Optimization Knowledge-Gradient Methods for Bayesian Optimization Peter I. Frazier Cornell University Uber Wu, Poloczek, Wilson & F., NIPS 17 Bayesian Optimization with Gradients Poloczek, Wang & F., NIPS 17 Multi

More information

KNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION

KNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION KNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor

More information

Predictive Variance Reduction Search

Predictive Variance Reduction Search Predictive Variance Reduction Search Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, Svetha Venkatesh Centre of Pattern Recognition and Data Analytics (PRaDA), Deakin University Email: v.nguyen@deakin.edu.au

More information

Quantifying mismatch in Bayesian optimization

Quantifying mismatch in Bayesian optimization Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato

More information

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating

More information

Probabilistic numerics for deep learning

Probabilistic numerics for deep learning Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017 Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Information-Based Multi-Fidelity Bayesian Optimization

Information-Based Multi-Fidelity Bayesian Optimization Information-Based Multi-Fidelity Bayesian Optimization Yehong Zhang, Trong Nghia Hoang, Bryan Kian Hsiang Low and Mohan Kankanhalli Department of Computer Science, National University of Singapore, Republic

More information

Parallelised Bayesian Optimisation via Thompson Sampling

Parallelised Bayesian Optimisation via Thompson Sampling Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017 Slides: www.cs.cmu.edu/~kkandasa/talks/google-ts-slides.pdf

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Talk on Bayesian Optimization

Talk on Bayesian Optimization Talk on Bayesian Optimization Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77-Cheongam-ro, Nam-gu, Pohang-si 37673, Gyungsangbuk-do,

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Bayesian optimization for automatic machine learning

Bayesian optimization for automatic machine learning Bayesian optimization for automatic machine learning Matthew W. Ho man based o work with J. M. Hernández-Lobato, M. Gelbart, B. Shahriari, and others! University of Cambridge July 11, 2015 Black-bo optimization

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

A General Framework for Constrained Bayesian Optimization using Information-based Search

A General Framework for Constrained Bayesian Optimization using Information-based Search Journal of Machine Learning Research 17 (2016) 1-53 Submitted 12/15; Revised 4/16; Published 9/16 A General Framework for Constrained Bayesian Optimization using Information-based Search José Miguel Hernández-Lobato

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Optimisation séquentielle et application au design

Optimisation séquentielle et application au design Optimisation séquentielle et application au design d expériences Nicolas Vayatis Séminaire Aristote, Ecole Polytechnique - 23 octobre 2014 Joint work with Emile Contal (computer scientist, PhD student)

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (6-83, F) Lecture# (Monday November ) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan Applications of Gaussian Processes (a) Inverse Kinematics

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Parallel Bayesian Global Optimization of Expensive Functions

Parallel Bayesian Global Optimization of Expensive Functions Parallel Bayesian Global Optimization of Expensive Functions Jialei Wang 1, Scott C. Clark 2, Eric Liu 3, and Peter I. Frazier 1 arxiv:1602.05149v3 [stat.ml] 1 Nov 2017 1 School of Operations Research

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp

Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp .. Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp Jialei Wang 1 Peter Frazier 1 Scott Clark 2 Eric Liu 2 1 School of Operations Research & Information Engineering,

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Dynamic Batch Bayesian Optimization

Dynamic Batch Bayesian Optimization Dynamic Batch Bayesian Optimization Javad Azimi EECS, Oregon State University azimi@eecs.oregonstate.edu Ali Jalali ECE, University of Texas at Austin alij@mail.utexas.edu Xiaoli Fern EECS, Oregon State

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Multi-Information Source Optimization

Multi-Information Source Optimization Multi-Information Source Optimization Matthias Poloczek, Jialei Wang, and Peter I. Frazier arxiv:1603.00389v2 [stat.ml] 15 Nov 2016 School of Operations Research and Information Engineering Cornell University

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

arxiv: v2 [stat.ml] 16 Oct 2017

arxiv: v2 [stat.ml] 16 Oct 2017 Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations arxiv:7.96v [stat.ml] 6 Oct 7 Eero Siivola, Aki Vehtari, Jarno Vanhatalo, Javier González,

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford Global Optimisation with Gaussian Processes Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford Global optimisation considers objective functions that

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Gaussian Processes. 1 What problems can be solved by Gaussian Processes? Statistical Techniques in Robotics (16-831, F1) Lecture#19 (Wednesday November 16) Gaussian Processes Lecturer: Drew Bagnell Scribe:Yamuna Krishnamurthy 1 1 What problems can be solved by Gaussian Processes?

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

arxiv: v1 [cs.lg] 10 Oct 2018

arxiv: v1 [cs.lg] 10 Oct 2018 Combining Bayesian Optimization and Lipschitz Optimization Mohamed Osama Ahmed Sharan Vaswani Mark Schmidt moahmed@cs.ubc.ca sharanv@cs.ubc.ca schmidtm@cs.ubc.ca University of British Columbia arxiv:1810.04336v1

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Context-Dependent Bayesian Optimization in Real-Time Optimal Control: A Case Study in Airborne Wind Energy Systems

Context-Dependent Bayesian Optimization in Real-Time Optimal Control: A Case Study in Airborne Wind Energy Systems Context-Dependent Bayesian Optimization in Real-Time Optimal Control: A Case Study in Airborne Wind Energy Systems Ali Baheri Department of Mechanical Engineering University of North Carolina at Charlotte

More information

Gaussian Process Optimization with Mutual Information

Gaussian Process Optimization with Mutual Information Gaussian Process Optimization with Mutual Information Emile Contal 1 Vianney Perchet 2 Nicolas Vayatis 1 1 CMLA Ecole Normale Suprieure de Cachan & CNRS, France 2 LPMA Université Paris Diderot & CNRS,

More information

A parametric approach to Bayesian optimization with pairwise comparisons

A parametric approach to Bayesian optimization with pairwise comparisons A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org

More information

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models Chun-Liang Li Kirthevasan Kandasamy Barnabás Póczos Jeff Schneider {chunlial, kandasamy, bapoczos, schneide}@cs.cmu.edu Carnegie

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Batch Bayesian Optimization via Simulation Matching

Batch Bayesian Optimization via Simulation Matching Batch Bayesian Optimization via Simulation Matching Javad Azimi, Alan Fern, Xiaoli Z. Fern School of EECS, Oregon State University {azimi, afern, xfern}@eecs.oregonstate.edu Abstract Bayesian optimization

More information

Nonparametric Bayesian inference on multivariate exponential families

Nonparametric Bayesian inference on multivariate exponential families Nonparametric Bayesian inference on multivariate exponential families William Vega-Brown, Marek Doniec, and Nicholas Roy Massachusetts Institute of Technology Cambridge, MA 2139 {wrvb, doniec, nickroy}@csail.mit.edu

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Local Expectation Gradients for Doubly Stochastic. Variational Inference Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

How to build an automatic statistician

How to build an automatic statistician How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Gaussian Process Regression: Active Data Selection and Test Point Rejection Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Department of Computer Science, Technical University of Berlin Franklinstr.8,

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

arxiv: v1 [stat.ml] 10 Dec 2017

arxiv: v1 [stat.ml] 10 Dec 2017 Sensitivity Analysis for Predictive Uncertainty in Bayesian Neural Networks Stefan Depeweg 1,2, José Miguel Hernández-Lobato 3, Steffen Udluft 2, Thomas Runkler 1,2 arxiv:1712.03605v1 [stat.ml] 10 Dec

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016

More information

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017 Motivation DGPs promise much, but are difficult to train Doubly Stochastic

More information

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Latent Variable Networks for Session-Based Recommendation Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information