EigenGP: Gaussian Process Models with Adaptive Eigenfunctions

Size: px
Start display at page:

Download "EigenGP: Gaussian Process Models with Adaptive Eigenfunctions"

Transcription

1 Proceedings of the Twent-Fourth International Joint Conference on Artificial Intelligence (IJCAI 5) : Gaussian Process Models with Adaptive Eigenfunctions Hao Peng Department of Computer Science Purdue Universit West Lafaette, I 797, USA pengh@purdue.edu Yuan Qi Departments of Computer Science and Statistics Purdue Universit West Lafaette, I 797, USA alanqi@cs.purdue.edu Abstract Gaussian processes (GPs) provide a nonparametric representation of functions. However, classical GP inference suffers from high computational cost for big data. In this paper, we propose a new Baesian approach,, that learns both basis dictionar elements eigenfunctions of a GP prior and prior precisions in a sparse finite model. It is well known that, among all orthogonal basis functions, eigenfunctions can provide the most compact representation. Unlike other sparse Baesian finite models where the basis function has a fied form, our eigenfunctions live in a reproducing kernel Hilbert space as a finite linear combination of kernel functions. We learn the dictionar elements eigenfunctions and the prior precisions over these elements as well as all the other hperparameters from data b maimizing the model marginal likelihood. We eplore computational linear algebra to simplif the gradient computation significantl. Our eperimental results demonstrate improved predictive performance of over alternative sparse GP methods as well as relevance vector machines. Introduction Gaussian processes (GPs) are powerful nonparametric Baesian models with numerous applications in machine learning and statistics. GP inference, however, is costl. Training the eact GP regression model with samples is epensive: it takes an O( ) space cost and an O( 3 ) time cost. To address this issue, a variet of approimate sparse GP inference approaches have been developed [Williams and Seeger, ; Csató and Opper, ; Snelson and Ghahramani, 6; Lázaro-Gredilla et al., ; Williams and Barber, 998; Titsias, 9; Qi et al., ; Higdon, ; Cressie and Johannesson, 8] for eample, using the ström method to approimate covariance matrices [Williams and Seeger, ], optimizing a variational bound on the marginal likelihood [Titsias, 9] or grounding the GP on a small set of (blurred) basis points [Snelson and Ghahramani, 6; Qi et al., ]. An elegant unifing view for various sparse GP regression models is given b Quiñonero-Candela and Rasmussen [5]. Among all sparse GP regression methods, a state-of-theart approach is to represent a function as a sparse finite linear combination of pairs of trigonometric basis functions, a sine and a cosine for each spectral point; thus this approach is called sparse spectrum Gaussian process () [Lázaro- Gredilla et al., ]. integrates out both weights and phases of the trigonometric functions and learns all hperparameters of the model (frequencies and amplitudes) b maimizing the marginal likelihood. Using global trigonometric functions as basis functions, has the capabilit of approimating an stationar Gaussian process model and been shown to outperform alternative sparse GP methods including full independent training conditional () approimation [Snelson and Ghahramani, 6] on benchmark datasets. Another popular sparse Baesian finite linear model is the relevance vector machine (RVM) [Tipping, ; Faul and Tipping, ]. It uses kernel epansions over training samples as basis functions and selects the basis functions b automatic relevance determination [MacKa, 99; Faul and Tipping, ]. In this paper, we propose a new sparse Baesian approach,, that learns both functional dictionar elements eigenfunctions and prior precisions in a finite linear model representation of a GP. It is well known that, among all orthogonal basis functions, eigenfunctions provide the most compact representation. Unlike or RVMs where the basis function has a fied form, our eigenfunctions live in a reproducing kernel Hilbert space (RKHS) as a finite linear combination of kernel functions with their weights learned from data. We further marginalize out weights over eigenfunctions and estimate all hperparameters including basis points for eigenfunctions, lengthscales, and precision of the weight prior b maimizing the model marginal likelihood (also known as evidence). To do so, we eplore computational linear algebra and greatl simplif the gradient computation for optimization (thus our optimization method is totall different from RVM optimization methods). As a result of this optimization, our eigenfunctions are data dependent and make capable of accuratel modeling nonstationar data. Furthermore, b adding an additional kernel term in our model, we can turn the finite model into an infinite model to model the prediction uncertaint better that 3763

2 is, it can give nonzero prediction variance when a test sample is far from the training samples. is computationall efficient. It takes O( M) space and O(M ) time for training on with M basis functions, which is same as and more efficient than RVMs (as RVMs learn weights over, not M, basis functions.). Similar to and, focuses on predictive accurac at low computational cost, rather than on faithfull converging towards the full GP as the number of basis functions grows. (For the latter case, please see the approach [Yan and Qi, ] that eplicitl minimizes the KL divergence between eact and approimate GP posterior processes.) The rest of the paper is organized as follows. Section describes the background of GPs. Section 3 presents the model and an illustrative eample. Section outlines the marginal likelihood maimization for learning dictionar elements and the other hperparameters. In Section 5, we discuss related work. Section 6 shows regression results on multiple benchmark regression datasets, demonstrating improved performance of over ström [Williams and Seeger, ], RVM,, and. Background of Gaussian Processes We denote independent and identicall distributed samples as D = {(, ),..., ( n, n )}, where i is a D dimensional input (i.e., eplanator variables) and i is a scalar output (i.e., a response), which we assume is the nois realization of a latent function f at i. A Gaussian process places a prior distribution over the latent function f. Its projection f at { i } i= defines a joint Gaussian distribution p(f ) = (f m, K), where, without an prior preference, the mean m are set to and the covariance function k( i, j ) K( i, j ) encodes the prior notion of smoothness. A popular choice is the anisotropic squared eponential covariance function: k(, ) = a ep ( ( ) T diag(η)( ) ), where the hperparameters include the signal variance a and the lengthscales η = {η d } D d=, controlling how fast the covariance decas with the distance between inputs. Using this covariance function, we can prune input dimensions b shrinking the corresponding lengthscales based on the data (when η d =, the d-th dimension becomes totall irrelevant to the covariance function value). This pruning is known as Automatic Relevance Determination (ARD) and therefore this covariance is also called the ARD squared eponential. ote that the covariance function value remains the same when ( ) is the same regardless where and are. This thus leads to a stationar GP model. For nonstationar data, however, a stationar GP model is a misfit. Although nonstationar GP models have been developed and applied to real world applications, the are often limited to low-dimensional problems, such as applications in spatial statistics [Paciorek and Schervish, ]. Constructing general nonstationar GP models remains a challenging task. For regression, we use a Gaussian likelihood function p( i f) = ( i f( i ), σ ), where σ is the variance of the observation noise. Given the Gaussian process prior over f and the data likelihood, the eact posterior process is p(f D, ) GP (f, k) i= p( i f). Although the posterior process for GP regression has an analtical form, we need to store and invert an b matri, which has the computational compleit O( 3 ), rendering GP unfeasible for big data analtics. 3 Model of To enable fast inference and obtain a nonstationar covariance function, our new model projects the GP prior in an eigensubspace. Specificall, we set the latent function f() = α j φ j () () j= where M and {φ j ()} are eigenfunctions of the GP prior. We assign a Gaussian prior over α = [α,..., α M ], α (, diag(w)), so that f follows a GP prior with zero mean and the following covariance function k(, ) = w j φ j ()φ j ( ). () j= To compute the eigenfunctions {φ j ()}, we can use the Galerkin projection to approimate them b Hermite polnomials [Marzouk and ajm, 9]. For high dimensional problems, however, this approach requires a tensor product of univariate Hermite polnomials that dramaticall increases the number of parameters. To avoid this problem, we appl the ström method [Williams and Seeger, ] that allows us to obtain an approimation to the eigenfunctions in a high dimensional space efficientl. Specificall, given inducing inputs (i.e. basis points) B = [b,..., b M ], we replace k(, )φ j ()p()d = λ j φ j ( ) (3) b its Monte Carlo approimation M k(, b i )φ j (b i ) λ j φ j () () i= Then, b evaluating this equation at B so that we can estimate the values of φ j (b i ) and λ j, we obtain the j-th eigenfunction φ j () as follows M φ j () = k()u (M) λ (M) j = k()u j (5) j where k() [k(, b ),..., k(, b M )], λ (M) j and u (M) j are the j-th eigenvalue and eigenvector of the covariance function evaluated at B, and u j = Mu (M) j /λ (M) j. ote that, for simplicit, we have chosen the number of the inducing inputs to be the same as the number of the eigenfunctions; in practice, we can use more inducing inputs while computing onl the top M eigenvectors. As shown in (5), our eigenfunction lives in a RKHS as a linear combination of the kernel functions evaluated at B with weights u j. 376

3 Inserting (5) into () we obtain f() = M α j j= i= u ij k(, b i ) (6) This equation reveals a two-laer structure of. The first laer linearl combines multiple kernel functions to generate each eigenfunction φ j. The second laer takes these eigenfunctions as the basis functions to generate the function value f. ote that f is a Baesian linear combination of {φ j } where the weights α are integrated out to avoid overfitting. All the model hperparameters are learned from data. Specificall, for the first laer, to learn the eigenfunctions {φ i }, we estimate the inducing inputs B and the kernel hperparameters (such as lengthscales η for the ARD kernel) b maimizing the model marginal likelihood. For the second laer, we marginalize out α to avoid overfitting and maimize the model marginal likelihood to learn the hperparameter w of the prior. With the estimated hperparameters, the prior over f is nonstationar because its covariance function in () varies at different regions of. This comes at no surprise since the eigenfunctions are tied with p() in (3). This nonstationarit reflects the fact that our model is adaptive to the distribution of the eplanator variables. ote that to recover the full uncertaint captured b the kernel function k, we can add the following term into the kernel function of : δ( )(k(, ) k(, )) (7) where δ(a) = if and onl if a =. Compared to the original model, which has a finite degree of freedom, this modified model has the infinite number of basis functions (assuming k has an infinite number of basis functions as the ARD kernel). Thus, this model can accuratel model the uncertaint of a test point even when it is far from the training set. We derive the optimization updates of all the hperparameters for both the original and modified models. But according to our eperiments, the modified model does not improve the prediction accurac over the original (it even reduces the accurac sometimes.). Therefore, we will focus on the original model in our presentation for its simplicit. Before we present details about hperparameter optimization, let us first look at an illustrative eample on the effect of hperparameter optimization, in particular, the optimization of the inducing inputs B. 3. Illustrative Eample For this eample, we consider a to dataset used b the algorithm [Snelson and Ghahramani, 6]. It contains one-dimensional training samples. We use 5 basis points (M = 5) and choose the ARD kernel. We then compare the basis functions {φ j } and the corresponding predictive distributions in two cases. For the first case, we use the kernel width η learned from the full GP model as the kernel width for, and appl K-means to set the basis points B as cluster centers. The idea of using K-means to set the basis points has been suggested b Zhang et al. [8] to minimize an error bound for the ström approimation. For Predictions Eigenfunctions Basis points learned from K-means Basis points estimated b evidence maimization 6 6 Figure : Illustration of the effect of optimizing basis points (i.e., inducing inputs). In the first row, the pink dots represent data points, the blue and red solid curves correspond to the predictive mean of and ± two standard deviations around the mean, the black curve corresponds to the predictive mean of the full GP, and the black crosses denote the basis points. In the second row, the curves in various colors represent the five eigenfunctions of. the second case, we optimize η, B and w b maimizing the marginal likelihood of. The results are shown in Figure. The first row demonstrates that, b optimizing the hperparameters, achieves the predictions ver close to what the full GP achieves but using onl 5 basis functions. In contrast, when the basis points are set to the cluster centers b K-means, leads to the prediction significantl different from that of the full GP and fails to capture the data trend, in particular, for (3, 5). The second row of Figure shows that K-means sets the basis points almost evenl spaced on, and accordingl the five eigenfunctions are smooth global basis functions whose shapes are not directl linked to the function the fit. Evidence maimization, b contrast, sets the basis points unevenl spaced to generate the basis functions whose shapes are more localized and adaptive to the function the fit; for eample, the eigenfunction represented b the red curve well matches the data on the right. Learning Hperparameters In this section we describe how we optimize all the hperparameters, denoted b θ, which include B in the covariance function (), and all the kernel hperparameters (e.g., a and η). To optimize θ, we maimize the marginal likelihood (i.e. evidence) based on a conjugate ewton method. We eplore two strategies for evidence maimization. The first one is sequential optimization, which first fies w while updating all the other hperparameters, and then optimizes w while fiing the other hperparameters. The second strateg is to optimize We use the code from code/matlab/doc 3765

4 all the hperparameters jointl. Here we skip the details for the more complicated joint optimization and describe the ke gradient formula for sequential optimization. The ke computation in our optimization is for the log marginal likelihood and its gradient: ln p( θ) = ln C T C ln(π), (8) [ d ln p( θ) = tr(c dc ) tr(c T C dc )] (9) where C = K + σ I, K = Φdiag(w)Φ T, and Φ = {φ m ( n )} is an b M matri. Because the rank of K is M, we can compute ln C and C efficientl with the cost of O(M ) via the matri inversion and determinant lemmas. Even with the use of the matri inverse lemma for the low-rank computation, a naive calculation would be ver costl. We appl identities from computational linear algebra [Minka, ; de Leeuw, 7] to simplif the needed computation dramaticall. To compute the derivative with respect to B, we first notice that, when w is fied, we have C = K + σ I = K XB K BB K BX + σ I () where K XB is the cross-covariance matri between the training data X and the inducing inputs B, and K BB is the covariance matri on B. For the ARD squared eponential kernel, utilizing the following identities, tr(p T Q) = vec(p) T vec(q) and vec(p Q) = diag(vec(p))vec(q), where vec( ) vectorizes a matri into a column vector, and represents the Hadamard product, we can derive the derivative of the first trace term in (9) tr(c dc ) = RX T diag(η) (R T ) (B T diag(η)) db SB T diag(η) + (S T ) (B T diag(η)) () where is a column vector of all ones, and R = (K BB K BX C ) K BX () S = (K BB K BX C K XBK BB ) K BB (3) ote that we can compute K BX C efficientl via low-rank updates. Also, R T (S T ) can be implemented efficientl b first summing over the columns of S (R) and then coping it multiple times without an multiplication operation. To obtain tr(c T C db and (3) b C dc ), we simpl replace C in () T C. Using similar derivations, we can obtain the derivatives with respect to the lengthscale η, a and σ respectivel. To compute the derivative with respect to w, we can use the formula tr(pdiag(w)q) = T (Q T P)w to obtain the two trace terms in (9) as follows: tr(c dc ) = T (Φ (C Φ)) dw () tr(c T C dc ) = T (Φ (C T C Φ)) dw (5) For either sequential or joint optimization, the overall computational compleit is O(ma(M, D)) where D is the data dimension. 5 Related Work Our work is closel related to the seminal work b [Williams and Seeger, ], but the differ in multiple aspects. First, we define a valid probabilistic model based on an eigendecomposition of the GP prior. B contrast, the previous approach [Williams and Seeger, ] aims at a low-rank approimation to the finite covariance/kernel matri used in GP training from a numerical approimation perspective and its predictive distribution is not well-formed in a probabilistic framework (e.g., it ma give a negative variance of the predictive distribution.). Second, while the ström method simpl uses the first few eigenvectors, we maimize the model marginal likelihood to adjust their weights in the covariance function. Third, eploring the clustering propert of the eigenfunctions of the Gaussian kernel, our approach can conduct semi-supervised learning, while the previous one cannot. The semi-supervised learning capabilit of is investigated in another paper of us. Fourth, the ström method lacks a principled wa to learn model hperparameters including the kernel width and the basis points while does not. Our work is also related to methods that use kernel principle component analsis (PCA) to speed up kernel machines [Hoegaerts et al., 5]. However, for these methods it can be difficult if not impossible to learn important hperparameters including kernel width for each dimension and inducing inputs (not a subset of the training samples). B contrast, learns all these hperparameters from data based on gradients of the model marginal likelihood. 6 Eperimental Results In this section, we compare and alternative methods on snthetic and real benchmark datasets. The alternative methods include the sparse GP methods,, and the ström method as well as RVMs. We implemented the ström method ourselves and downloaded the software implementations for the other methods from their authors websites. For RVMs, we used the fast fied point algorithm [Faul and Tipping, ]. We used the ARD kernel for all the methods ecept RVMs (since the do not estimate the lengthscales in this kernel) and optimized all the hperparameters via evidence maimization. For RVMs, we chose the squared eponential kernel with the same lengthscale for all the dimensions and applied a -fold cross-validation on the training data to select the lengthscale. On large real data, we used the values of η, a, and σ learned from the full GP on a subset that was / of the training data to initialize all the methods ecept RVMs. For the rest configurations, we used the default setting of the downloaded software packages. For our own model, we denote the versions with sequential and joint optimization as and *, respectivel. To evaluate the test performance of each method, we measure the ormalized Mean Square Error (MSE) and the Mean The implementation is available at:

5 egative Log Probabilit (MLP), defined as: MSE = i ( i µ i ) / i (µ i ȳ) (6) MLP = i µi i [( σ i ) + ln σi + ln π] (7) where i, µ i and σi are the response value, the predictive mean and variance for the i-th test point respectivel, and ȳ is the average response value of the training data. 6. Approimation Qualit on Snthetic Data Table : MSE and MLP on snthetic data MSE Method dataset dataset ström 39 ± 8 56 ± 769 ström*. ±.53 7 ± 37. ±.5.5 ±..5 ±.. ±.5.6 ±..6 ±..9 ±..6 ±. (a) ström (b) (c) (d) MLP Method dataset dataset ström 65 ± ± 67 ström* 7.39 ±.66 ± 5.7 ±..88 ±.5. ±.3.87 ±.9.33 ±.. ±.7.3 ±.. ± Figure : Predictions of four sparse GP methods. Pink dots represent training data; blue curves are predictive means; red curves are two standard deviations above and below the mean curves, and the black crosses indicate the inducing inputs. As in Section 3., we use the snthetic data from the paper for the comparative stud. To let all the methods have the same computational compleit, we set the number of inducing inputs M = 7. The results are summarized in Figure. For the ström method, we used the kernel width learned from the full GP and applied K-means to choose the basis locations [Zhang et al., 8]. Figure (a) shows that it does not fit well. Figure (b) demonstrates that the prediction of oscillates outside the range of the training samples, probabl due to the fact that the sinusoidal components are global and span the whole data range (increasing the number of basis functions would improve s predictive performance, but increase the computational cost.). As shown b Figure (c), fails to capture the turns of the data accuratel for near while can. Using the full GP predictive mean as the label for [, 7] (we do not have the true values in the test data), we compute the MSE and MLP of all the methods. The average results from runs are reported in Table (dataset ). We have the results from two versions of the ström method. For the first version, the kernel width is learned from the full GP and the basis locations are chosen b K-means as before; for the second version, denoted as ström*, its hperparameters are learned b evidence maimization. ote that the evidence maimization algorithm for the ström approimation is novel too developed b us for the comparative analsis. Table shows that both and * approimate the mean of the full GP model more accuratel than the other methods, in particular, several orders of magnitude better than the ström method. Furthermore, we add the difference term (7) into the kernel function and denote this version of our algorithm as +. It gives better predictive variance when far from the training data but its predictive mean is slightl worse than the version without this term (7); the MSE and MLP of + are. ±. and.8 ±.. Thus, on the other datasets, we onl use the versions without this term ( and ) for their simplicit and effectiveness. We also eamine the performance of all these methods with a higher computational compleit. Specificall, we set M =. Again, both versions of the ström method give poor predictive distributions. And still leads to etra wav patterns outside the training data., and + give good predictions. Again, + gives better predictive variance when far from the training data, but with a similar predictive mean as. Finall, we compare the RVM with * on this dataset. While the RVM gives MSE =.8 in. seconds, * achieves MSE =.39 ±.7 in.33 ±. second with M = 3 ( performs similarl), both faster and more accurate. 6. Prediction Qualit on onstationar Data We then compare all the sparse GP methods on an onedimensional nonstationar snthetic dataset with training and 5 test samples. The underling function is f() = sin( 3 ) where (, 3) and the standard deviation of the white noise is.5. This function is nonstationar in the sense that its frequenc and amplitude increase when increases from. We randoml generated the data times and set the number of basis points (functions) to be for all the com- 3767

6 (a) (b) (c) Figure 3: Predictions on nonstationar data. The pink dots correspond to nois data around the true function f() = sin( 3 ), represented b the green curves. The blue and red solid curves correspond to the predictive means and ± two standard deviations around the means. The black crosses near the bottom represent the estimated basis points for and. petitive methods. Using the true function value as the label, we compute the means and the standard errors of MSE and MLP as in Table (dataset ). For the ström method, the marginal likelihood optimization leads to much smaller error than the K-means based approach. However, both of them fare poorl when compared with the alternative methods. Table also shows that and achieve a striking 5, fold error reduction compared with ström*, and a -fold error reduction compared with the second best method,. RVMs gave MSE.±. with. ±.5 seconds, averaged over runs, while the results of * with M = 5 are MSE. ±.6 with.89 ±. seconds ( gives similar results). We further illustrate the predictive mean and standard deviation on a tpical run in Figure 3. As shown in Figure 3(a), the predictive mean of contains reasonable high frequenc components for (, 3) but, as a stationar GP model, these high frequenc components give etra wav patterns in the left region of. In addition, the predictive mean on the right is smaller than the true one, probabl affected b the small dnamic range of the data on the left. Figure 3(b) shows that the predictive mean of at (, 3) has lower frequenc and smaller amplitude than the true function perhaps influenced b the low-frequenc part on the left (, ). Actuall because of the low-frequenc part, learns a large kernel width η; the average kernel width learned over the runs is This large value affects the qualit of learned basis points (e.g., lacking of basis points for the high frequenc region on the right). B contrast, using the same initial kernel width as, learns a suitable kernel width on average, η =.7 and provides good predictions as shown in Figure 3(c). 6.3 Accurac vs. Time on Real Data To evaluate the trade-off between prediction accurac and computational cost, we use three large real datasets. The first dataset is California Housing [Pace and Barr, 997]. We randoml split the 8 dimensional data into, training and,6 test points. The second dataset is Phsicochemical Properties of Protein Tertiar Structures (PPPTS) which can be obtained from Lichman [3]. We randoml split the 9 dimensional data into, training and 5,73 test points. The third dataset is Pole Telecomm that was used in Làzaro-Gredilla et al. []. It contains, training and 5 test samples, each of which has 6 features. We set M = 5, 5,,,, and the maimum number of iterations in optimization to be for all methods. The MSE, MLP and the training time of these methods are shown in Figure. In addition, we ran the ström method based on the marginal likelihood maimization, which is better than using K-means to set the basis points. Again, the ström method performed orders of magnitude worse than the other methods: with M = 5, 5,,,, on California Housing, the ström method uses 6, 83, 3, 359 and 75 seconds for training, respectivel, and gives the MSE 97,, 3, 37, and 95; on PPPTS, the training times are 58, 3, 5, 853 and seconds and the MSEs are.7 5, 6.,.3, 8. 3, and 8. 3 ; and on Pole Telecomm, the training times are 79, 5, 67, 78, and 959 seconds and the MSEs are.3, 5.8 3, 3.8 3,.5 and 8. The MLPs are consistentl large, and are omitted here for simplicit. For RVMs, we include cross-validation in its training time because choosing an appropriate kernel width is crucial for RVM. Since RVM learns the number of basis functions automaticall from the data, in Figure it shows a single result for each dataset. achieves the lowest prediction error using shorter time. Compared with based on the sequential optimization, achieves similar errors, but takes longer because the joint optimization is more epensive. 7 Conclusions In this paper we have presented a simple et effective sparse Gaussian process method,, and applied it to regression. Despite its similarit to the ström method, can improve its prediction qualit b several orders of magnitude. can be easil etended to conduct online learning b either using stochastic gradient descent to update the weights of the eigenfunctions or appling the online VB idea for GPs [Hensman et al., 3]. Acknowledgments This work was supported b SF ECCS-93, SF CA- REER award IIS-593, and the Center for Science of Information, an SF Science and Technolog Center, under grant agreement CCF

7 (a) California Housing MSE (b) PPPTS MSE (c) Pole Telecomm MSE RVM * MLP RVM.6.55 *.5.5 MLP. 6.5 RVM..9 *.6.3 MLP 3 MSE * * * MLP Figure : MSE and MLP vs. training time. Each method (ecept RVMs) has five results associated with M = 5, 5,,,, respectivel. In (c), the fifth result of is out of the range; the actual training time is 385 seconds, the MSE.33, and the MLP.7. The values of MLP for RVMs are.3,.5 and.95, respectivel. References [Cressie and Johannesson, 8] oel Cressie and Gardar Johannesson. Fied rank kriging for ver large spatial data sets. Journal of the Roal Statistical Societ: Series B (Statistical Methodolog), 7():9 6, Februar 8. [Csató and Opper, ] Lehel Csató and Manfred Opper. Sparse online Gaussian processes. eural Computation, :6 668, March. [de Leeuw, 7] Jan de Leeuw. Derivatives of generalized eigen sstems with applications. In Department of Statistics Papers. Department of Statistics, UCLA, UCLA, 7. [Faul and Tipping, ] Anita C. Faul and Michael E. Tipping. Analsis of sparse Baesian learning. In Advances in eural Information Processing Sstems. MIT Press,. [Hensman et al., 3] James Hensman, icolò Fusi, and eil D. Lawrence. Gaussian processes for big data. In the 9th Conference on Uncertaint in Articial Intelligence (UAI 3), pages 8 9, 3. [Higdon, ] Dave Higdon. Space and space-time modeling using process convolutions. In Clive W. Anderson, Vic Barnett, Philip C. Chatwin, and Abdel H. El-Shaarawi, editors, Quantitative methods for current environmental issues, pages Springer Verlag,. [Hoegaerts et al., 5] Luc Hoegaerts, Johan A. K. Sukens, Joos Vandewalle, and Bart De Moor. Subset based least squares subspace regression in RKHS. eurocomputing, 63:93 33, Januar 5. [Lázaro-Gredilla et al., ] Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl E. Rasmussen, and Aníbal R. Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, :865 88,. [Lichman, 3] Moshe Lichman. UCI machine learning repositor, 3. [MacKa, 99] David J. C. MacKa. Baesian interpolation. eural Computation, (3):5 7, 99. [Marzouk and ajm, 9] Youssef M. Marzouk and Habib. ajm. Dimensionalit reduction and polnomial chaos acceleration of Baesian inference in inverse problems. Journal of Computational Phsics, 8(6):86 9, 9. [Minka, ] Thomas P. Minka. Old and new matri algebra useful for statistics. Technical report, MIT Media Lab,. [Pace and Barr, 997] R. Kelle Pace and Ronald Barr. Sparse spatial autoregressions. Statistics and Probabilit Letters, 33(3):9 97, 997. [Paciorek and Schervish, ] Christopher J. Paciorek and Mark J. Schervish. onstationar covariance functions for Gaussian process regression. In Advances in eural Information Processing Sstems 6. MIT Press,. [Qi et al., ] Yuan Qi, Ahmed H. Abdel-Gawad, and Thomas P. Minka. Sparse-posterior Gaussian processes for general likelihoods. In Proceedings of the 6th Conference on Uncertaint in Artificial Intelligence,. [Quiñonero-Candela and Rasmussen, 5] Joaquin Quiñonero- Candela and Carl E. Rasmussen. A unifing view of sparse approimate Gaussian process regression. Journal of Machine Learning Research, 6: , 5. [Snelson and Ghahramani, 6] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in eural Information Processing Sstems 8. MIT press, 6. [Tipping, ] Michael E. Tipping. The relevance vector machine. In Advances in eural Information Processing Sstems. MIT Press,. [Titsias, 9] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In The th International Conference on Artificial Intelligence and Statistics, pages , 9. [Williams and Barber, 998] Christopher K. I. Williams and David Barber. Baesian classification with Gaussian processes. IEEE Transactions on Pattern Analsis and Machine Intelligence, ():3 35, 998. [Williams and Seeger, ] Christopher K. I. Williams and Matthias Seeger. Using the ström method to speed up kernel machines. In Advances in eural Information Processing Sstems 3, volume 3. MIT Press,. [Yan and Qi, ] Feng Yan and Yuan Qi. Sparse Gaussian process regression via l penalization. In Proceedings of 7th International Conference on Machine Learning, pages 83 9,. [Zhang et al., 8] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved ström low-rank approimation and error analsis. In Proceedings of the 5th International Conference on Machine Learning, pages ACM,

Variable sigma Gaussian processes: An expectation propagation perspective

Variable sigma Gaussian processes: An expectation propagation perspective Variable sigma Gaussian processes: An expectation propagation perspective Yuan (Alan) Qi Ahmed H. Abdel-Gawad CS & Statistics Departments, Purdue University ECE Department, Purdue University alanqi@cs.purdue.edu

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Perturbation Theory for Variational Inference

Perturbation Theory for Variational Inference Perturbation heor for Variational Inference Manfred Opper U Berlin Marco Fraccaro echnical Universit of Denmark Ulrich Paquet Apple Ale Susemihl U Berlin Ole Winther echnical Universit of Denmark Abstract

More information

Gaussian Process Vine Copulas for Multivariate Dependence

Gaussian Process Vine Copulas for Multivariate Dependence Gaussian Process Vine Copulas for Multivariate Dependence José Miguel Hernández-Lobato 1,2 joint work with David López-Paz 2,3 and Zoubin Ghahramani 1 1 Department of Engineering, Cambridge University,

More information

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

More information

Sparse Spectral Sampling Gaussian Processes

Sparse Spectral Sampling Gaussian Processes Sparse Spectral Sampling Gaussian Processes Miguel Lázaro-Gredilla Department of Signal Processing & Communications Universidad Carlos III de Madrid, Spain miguel@tsc.uc3m.es Joaquin Quiñonero-Candela

More information

Tree-structured Gaussian Process Approximations

Tree-structured Gaussian Process Approximations Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary

More information

Lecture 1c: Gaussian Processes for Regression

Lecture 1c: Gaussian Processes for Regression Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk

More information

Deep learning with differential Gaussian process flows

Deep learning with differential Gaussian process flows Deep learning with differential Gaussian process flows Pashupati Hegde Markus Heinonen Harri Lähdesmäki Samuel Kaski Helsinki Institute for Information Technology HIIT Department of Computer Science, Aalto

More information

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure Overview Breiman L (2001). Statistical modeling: The two cultures Statistical learning reading group Aleander L 1 Histor of statistical/machine learning 2 Supervised learning 3 Two approaches to supervised

More information

Nonstationary Covariance Functions for Gaussian Process Regression

Nonstationary Covariance Functions for Gaussian Process Regression Nonstationar Covariance Functions for Gaussian Process Regression Christopher J. Paciorek Department of Biostatistics Harvard School of Public Health Mark J. Schervish Department of Statistics Carnegie

More information

Gaussian Processes for Big Data. James Hensman

Gaussian Processes for Big Data. James Hensman Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

6. Vector Random Variables

6. Vector Random Variables 6. Vector Random Variables In the previous chapter we presented methods for dealing with two random variables. In this chapter we etend these methods to the case of n random variables in the following

More information

Sparse Spectrum Gaussian Process Regression

Sparse Spectrum Gaussian Process Regression Journal of Machine Learning Research 11 (2010) 1865-1881 Submitted 2/08; Revised 2/10; Published 6/10 Sparse Spectrum Gaussian Process Regression Miguel Lázaro-Gredilla Departamento de Teoría de la Señal

More information

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Computation of Csiszár s Mutual Information of Order α

Computation of Csiszár s Mutual Information of Order α Computation of Csiszár s Mutual Information of Order Damianos Karakos, Sanjeev Khudanpur and Care E. Priebe Department of Electrical and Computer Engineering and Center for Language and Speech Processing

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if

More information

Local and global sparse Gaussian process approximations

Local and global sparse Gaussian process approximations Local and global sparse Gaussian process approximations Edward Snelson Gatsby Computational euroscience Unit University College London, UK snelson@gatsby.ucl.ac.uk Zoubin Ghahramani Department of Engineering

More information

Spatial smoothing using Gaussian processes

Spatial smoothing using Gaussian processes Spatial smoothing using Gaussian processes Chris Paciorek paciorek@hsph.harvard.edu August 5, 2004 1 OUTLINE Spatial smoothing and Gaussian processes Covariance modelling Nonstationary covariance modelling

More information

INF Introduction to classifiction Anne Solberg

INF Introduction to classifiction Anne Solberg INF 4300 8.09.17 Introduction to classifiction Anne Solberg anne@ifi.uio.no Introduction to classification Based on handout from Pattern Recognition b Theodoridis, available after the lecture INF 4300

More information

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging

More information

What, exactly, is a cluster? - Bernhard Schölkopf, personal communication

What, exactly, is a cluster? - Bernhard Schölkopf, personal communication Chapter 1 Warped Mixture Models What, exactly, is a cluster? - Bernhard Schölkopf, personal communication Previous chapters showed how the probabilistic nature of GPs sometimes allows the automatic determination

More information

A parametric approach to Bayesian optimization with pairwise comparisons

A parametric approach to Bayesian optimization with pairwise comparisons A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org

More information

Maximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine

Maximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine Maximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine Y. Dak Hazirbaba, J. Tezcan, Q. Cheng Southern Illinois University Carbondale, IL, USA SUMMARY: The 2009

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

CSE 546 Midterm Exam, Fall 2014

CSE 546 Midterm Exam, Fall 2014 CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Circular Pseudo-Point Approximations for Scaling Gaussian Processes

Circular Pseudo-Point Approximations for Scaling Gaussian Processes Circular Pseudo-Point Approximations for Scaling Gaussian Processes Will Tebbutt Invenia Labs, Cambridge, UK will.tebbutt@invenialabs.co.uk Thang D. Bui University of Cambridge tdb4@cam.ac.uk Richard E.

More information

Gaussian Process Vine Copulas for Multivariate Dependence

Gaussian Process Vine Copulas for Multivariate Dependence Gaussian Process Vine Copulas for Multivariate Dependence José Miguel Hernández Lobato 1,2, David López Paz 3,2 and Zoubin Ghahramani 1 June 27, 2013 1 University of Cambridge 2 Equal Contributor 3 Ma-Planck-Institute

More information

Kernel PCA, clustering and canonical correlation analysis

Kernel PCA, clustering and canonical correlation analysis ernel PCA, clustering and canonical correlation analsis Le Song Machine Learning II: Advanced opics CSE 8803ML, Spring 2012 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w j + b j 1 ξ j,

More information

A comparison of estimation accuracy by the use of KF, EKF & UKF filters

A comparison of estimation accuracy by the use of KF, EKF & UKF filters Computational Methods and Eperimental Measurements XIII 779 A comparison of estimation accurac b the use of KF EKF & UKF filters S. Konatowski & A. T. Pieniężn Department of Electronics Militar Universit

More information

Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems

Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems Deep Nonlinear Non-Gaussian Filtering for Dnamical Sstems Arash Mehrjou Department of Empirical Inference Ma Planck Institute for Intelligent Sstems arash.mehrjou@tuebingen.mpg.de Bernhard Schölkopf Department

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction

More information

arxiv: v1 [stat.ml] 2 Aug 2012

arxiv: v1 [stat.ml] 2 Aug 2012 Ancestral Inference from Functional Data: Statistical Methods and Numerical Eamples Pantelis Z. Hadjipantelis, Nick, S. Jones, John Moriart, David Springate, Christopher G. Knight arxiv:1208.0628v1 [stat.ml]

More information

Linear regression Class 25, Jeremy Orloff and Jonathan Bloom

Linear regression Class 25, Jeremy Orloff and Jonathan Bloom 1 Learning Goals Linear regression Class 25, 18.05 Jerem Orloff and Jonathan Bloom 1. Be able to use the method of least squares to fit a line to bivariate data. 2. Be able to give a formula for the total

More information

Least-Squares Independence Test

Least-Squares Independence Test IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Variational Dropout via Empirical Bayes

Variational Dropout via Empirical Bayes Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analsis Matthew B. Blaschko, Christoph H. Lampert, & Arthur Gretton Ma Planck Institute for Biological Cbernetics Tübingen, German

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Gaussian Processes. 1 What problems can be solved by Gaussian Processes? Statistical Techniques in Robotics (16-831, F1) Lecture#19 (Wednesday November 16) Gaussian Processes Lecturer: Drew Bagnell Scribe:Yamuna Krishnamurthy 1 1 What problems can be solved by Gaussian Processes?

More information

HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551

HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551 HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/55 Outline Framework Eact Minimize Mistakes Perceptron Training Matri inversion LMS Logistic Regression Ma Likelihood Estimation MLE of P Gradient

More information

Krylov Integration Factor Method on Sparse Grids for High Spatial Dimension Convection Diffusion Equations

Krylov Integration Factor Method on Sparse Grids for High Spatial Dimension Convection Diffusion Equations DOI.7/s95-6-6-7 Krlov Integration Factor Method on Sparse Grids for High Spatial Dimension Convection Diffusion Equations Dong Lu Yong-Tao Zhang Received: September 5 / Revised: 9 March 6 / Accepted: 8

More information

The Automatic Statistician

The Automatic Statistician The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Non-Factorised Variational Inference in Dynamical Systems

Non-Factorised Variational Inference in Dynamical Systems st Symposium on Advances in Approximate Bayesian Inference, 08 6 Non-Factorised Variational Inference in Dynamical Systems Alessandro D. Ialongo University of Cambridge and Max Planck Institute for Intelligent

More information

2: Distributions of Several Variables, Error Propagation

2: Distributions of Several Variables, Error Propagation : Distributions of Several Variables, Error Propagation Distribution of several variables. variables The joint probabilit distribution function of two variables and can be genericall written f(, with the

More information

Expectation propagation for signal detection in flat-fading channels

Expectation propagation for signal detection in flat-fading channels Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA

More information

State Space Gaussian Processes with Non-Gaussian Likelihoods

State Space Gaussian Processes with Non-Gaussian Likelihoods State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

CONTINUOUS SPATIAL DATA ANALYSIS

CONTINUOUS SPATIAL DATA ANALYSIS CONTINUOUS SPATIAL DATA ANALSIS 1. Overview of Spatial Stochastic Processes The ke difference between continuous spatial data and point patterns is that there is now assumed to be a meaningful value, s

More information

Bayesian Inference of Noise Levels in Regression

Bayesian Inference of Noise Levels in Regression Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop

More information

Proper Orthogonal Decomposition Extensions For Parametric Applications in Transonic Aerodynamics

Proper Orthogonal Decomposition Extensions For Parametric Applications in Transonic Aerodynamics Proper Orthogonal Decomposition Etensions For Parametric Applications in Transonic Aerodnamics T. Bui-Thanh, M. Damodaran Singapore-Massachusetts Institute of Technolog Alliance (SMA) School of Mechanical

More information

A New Perspective and Extension of the Gaussian Filter

A New Perspective and Extension of the Gaussian Filter Robotics: Science and Sstems 2015 Rome, Ital, Jul 13-17, 2015 A New Perspective and Etension of the Gaussian Filter Manuel Wüthrich, Sebastian Trimpe, Daniel Kappler and Stefan Schaal Autonomous Motion

More information

Non Linear Latent Variable Models

Non Linear Latent Variable Models Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable

More information

Where now? Machine Learning and Bayesian Inference

Where now? Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin

More information

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference

More information

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS Oloyede I. Department of Statistics, University of Ilorin, Ilorin, Nigeria Corresponding Author: Oloyede I.,

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Mathematics 309 Conic sections and their applicationsn. Chapter 2. Quadric figures. ai,j x i x j + b i x i + c =0. 1. Coordinate changes

Mathematics 309 Conic sections and their applicationsn. Chapter 2. Quadric figures. ai,j x i x j + b i x i + c =0. 1. Coordinate changes Mathematics 309 Conic sections and their applicationsn Chapter 2. Quadric figures In this chapter want to outline quickl how to decide what figure associated in 2D and 3D to quadratic equations look like.

More information

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui School of Mathematical and Computing Sciences Computer Science Multiple Output Gaussian Process Regression Phillip Boyle and

More information

Applications of Proper Orthogonal Decomposition for Inviscid Transonic Aerodynamics

Applications of Proper Orthogonal Decomposition for Inviscid Transonic Aerodynamics Applications of Proper Orthogonal Decomposition for Inviscid Transonic Aerodnamics Bui-Thanh Tan, Karen Willco and Murali Damodaran Abstract Two etensions to the proper orthogonal decomposition (POD) technique

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information