EigenGP: Gaussian Process Models with Adaptive Eigenfunctions
|
|
- Walter Theodore Underwood
- 5 years ago
- Views:
Transcription
1 Proceedings of the Twent-Fourth International Joint Conference on Artificial Intelligence (IJCAI 5) : Gaussian Process Models with Adaptive Eigenfunctions Hao Peng Department of Computer Science Purdue Universit West Lafaette, I 797, USA pengh@purdue.edu Yuan Qi Departments of Computer Science and Statistics Purdue Universit West Lafaette, I 797, USA alanqi@cs.purdue.edu Abstract Gaussian processes (GPs) provide a nonparametric representation of functions. However, classical GP inference suffers from high computational cost for big data. In this paper, we propose a new Baesian approach,, that learns both basis dictionar elements eigenfunctions of a GP prior and prior precisions in a sparse finite model. It is well known that, among all orthogonal basis functions, eigenfunctions can provide the most compact representation. Unlike other sparse Baesian finite models where the basis function has a fied form, our eigenfunctions live in a reproducing kernel Hilbert space as a finite linear combination of kernel functions. We learn the dictionar elements eigenfunctions and the prior precisions over these elements as well as all the other hperparameters from data b maimizing the model marginal likelihood. We eplore computational linear algebra to simplif the gradient computation significantl. Our eperimental results demonstrate improved predictive performance of over alternative sparse GP methods as well as relevance vector machines. Introduction Gaussian processes (GPs) are powerful nonparametric Baesian models with numerous applications in machine learning and statistics. GP inference, however, is costl. Training the eact GP regression model with samples is epensive: it takes an O( ) space cost and an O( 3 ) time cost. To address this issue, a variet of approimate sparse GP inference approaches have been developed [Williams and Seeger, ; Csató and Opper, ; Snelson and Ghahramani, 6; Lázaro-Gredilla et al., ; Williams and Barber, 998; Titsias, 9; Qi et al., ; Higdon, ; Cressie and Johannesson, 8] for eample, using the ström method to approimate covariance matrices [Williams and Seeger, ], optimizing a variational bound on the marginal likelihood [Titsias, 9] or grounding the GP on a small set of (blurred) basis points [Snelson and Ghahramani, 6; Qi et al., ]. An elegant unifing view for various sparse GP regression models is given b Quiñonero-Candela and Rasmussen [5]. Among all sparse GP regression methods, a state-of-theart approach is to represent a function as a sparse finite linear combination of pairs of trigonometric basis functions, a sine and a cosine for each spectral point; thus this approach is called sparse spectrum Gaussian process () [Lázaro- Gredilla et al., ]. integrates out both weights and phases of the trigonometric functions and learns all hperparameters of the model (frequencies and amplitudes) b maimizing the marginal likelihood. Using global trigonometric functions as basis functions, has the capabilit of approimating an stationar Gaussian process model and been shown to outperform alternative sparse GP methods including full independent training conditional () approimation [Snelson and Ghahramani, 6] on benchmark datasets. Another popular sparse Baesian finite linear model is the relevance vector machine (RVM) [Tipping, ; Faul and Tipping, ]. It uses kernel epansions over training samples as basis functions and selects the basis functions b automatic relevance determination [MacKa, 99; Faul and Tipping, ]. In this paper, we propose a new sparse Baesian approach,, that learns both functional dictionar elements eigenfunctions and prior precisions in a finite linear model representation of a GP. It is well known that, among all orthogonal basis functions, eigenfunctions provide the most compact representation. Unlike or RVMs where the basis function has a fied form, our eigenfunctions live in a reproducing kernel Hilbert space (RKHS) as a finite linear combination of kernel functions with their weights learned from data. We further marginalize out weights over eigenfunctions and estimate all hperparameters including basis points for eigenfunctions, lengthscales, and precision of the weight prior b maimizing the model marginal likelihood (also known as evidence). To do so, we eplore computational linear algebra and greatl simplif the gradient computation for optimization (thus our optimization method is totall different from RVM optimization methods). As a result of this optimization, our eigenfunctions are data dependent and make capable of accuratel modeling nonstationar data. Furthermore, b adding an additional kernel term in our model, we can turn the finite model into an infinite model to model the prediction uncertaint better that 3763
2 is, it can give nonzero prediction variance when a test sample is far from the training samples. is computationall efficient. It takes O( M) space and O(M ) time for training on with M basis functions, which is same as and more efficient than RVMs (as RVMs learn weights over, not M, basis functions.). Similar to and, focuses on predictive accurac at low computational cost, rather than on faithfull converging towards the full GP as the number of basis functions grows. (For the latter case, please see the approach [Yan and Qi, ] that eplicitl minimizes the KL divergence between eact and approimate GP posterior processes.) The rest of the paper is organized as follows. Section describes the background of GPs. Section 3 presents the model and an illustrative eample. Section outlines the marginal likelihood maimization for learning dictionar elements and the other hperparameters. In Section 5, we discuss related work. Section 6 shows regression results on multiple benchmark regression datasets, demonstrating improved performance of over ström [Williams and Seeger, ], RVM,, and. Background of Gaussian Processes We denote independent and identicall distributed samples as D = {(, ),..., ( n, n )}, where i is a D dimensional input (i.e., eplanator variables) and i is a scalar output (i.e., a response), which we assume is the nois realization of a latent function f at i. A Gaussian process places a prior distribution over the latent function f. Its projection f at { i } i= defines a joint Gaussian distribution p(f ) = (f m, K), where, without an prior preference, the mean m are set to and the covariance function k( i, j ) K( i, j ) encodes the prior notion of smoothness. A popular choice is the anisotropic squared eponential covariance function: k(, ) = a ep ( ( ) T diag(η)( ) ), where the hperparameters include the signal variance a and the lengthscales η = {η d } D d=, controlling how fast the covariance decas with the distance between inputs. Using this covariance function, we can prune input dimensions b shrinking the corresponding lengthscales based on the data (when η d =, the d-th dimension becomes totall irrelevant to the covariance function value). This pruning is known as Automatic Relevance Determination (ARD) and therefore this covariance is also called the ARD squared eponential. ote that the covariance function value remains the same when ( ) is the same regardless where and are. This thus leads to a stationar GP model. For nonstationar data, however, a stationar GP model is a misfit. Although nonstationar GP models have been developed and applied to real world applications, the are often limited to low-dimensional problems, such as applications in spatial statistics [Paciorek and Schervish, ]. Constructing general nonstationar GP models remains a challenging task. For regression, we use a Gaussian likelihood function p( i f) = ( i f( i ), σ ), where σ is the variance of the observation noise. Given the Gaussian process prior over f and the data likelihood, the eact posterior process is p(f D, ) GP (f, k) i= p( i f). Although the posterior process for GP regression has an analtical form, we need to store and invert an b matri, which has the computational compleit O( 3 ), rendering GP unfeasible for big data analtics. 3 Model of To enable fast inference and obtain a nonstationar covariance function, our new model projects the GP prior in an eigensubspace. Specificall, we set the latent function f() = α j φ j () () j= where M and {φ j ()} are eigenfunctions of the GP prior. We assign a Gaussian prior over α = [α,..., α M ], α (, diag(w)), so that f follows a GP prior with zero mean and the following covariance function k(, ) = w j φ j ()φ j ( ). () j= To compute the eigenfunctions {φ j ()}, we can use the Galerkin projection to approimate them b Hermite polnomials [Marzouk and ajm, 9]. For high dimensional problems, however, this approach requires a tensor product of univariate Hermite polnomials that dramaticall increases the number of parameters. To avoid this problem, we appl the ström method [Williams and Seeger, ] that allows us to obtain an approimation to the eigenfunctions in a high dimensional space efficientl. Specificall, given inducing inputs (i.e. basis points) B = [b,..., b M ], we replace k(, )φ j ()p()d = λ j φ j ( ) (3) b its Monte Carlo approimation M k(, b i )φ j (b i ) λ j φ j () () i= Then, b evaluating this equation at B so that we can estimate the values of φ j (b i ) and λ j, we obtain the j-th eigenfunction φ j () as follows M φ j () = k()u (M) λ (M) j = k()u j (5) j where k() [k(, b ),..., k(, b M )], λ (M) j and u (M) j are the j-th eigenvalue and eigenvector of the covariance function evaluated at B, and u j = Mu (M) j /λ (M) j. ote that, for simplicit, we have chosen the number of the inducing inputs to be the same as the number of the eigenfunctions; in practice, we can use more inducing inputs while computing onl the top M eigenvectors. As shown in (5), our eigenfunction lives in a RKHS as a linear combination of the kernel functions evaluated at B with weights u j. 376
3 Inserting (5) into () we obtain f() = M α j j= i= u ij k(, b i ) (6) This equation reveals a two-laer structure of. The first laer linearl combines multiple kernel functions to generate each eigenfunction φ j. The second laer takes these eigenfunctions as the basis functions to generate the function value f. ote that f is a Baesian linear combination of {φ j } where the weights α are integrated out to avoid overfitting. All the model hperparameters are learned from data. Specificall, for the first laer, to learn the eigenfunctions {φ i }, we estimate the inducing inputs B and the kernel hperparameters (such as lengthscales η for the ARD kernel) b maimizing the model marginal likelihood. For the second laer, we marginalize out α to avoid overfitting and maimize the model marginal likelihood to learn the hperparameter w of the prior. With the estimated hperparameters, the prior over f is nonstationar because its covariance function in () varies at different regions of. This comes at no surprise since the eigenfunctions are tied with p() in (3). This nonstationarit reflects the fact that our model is adaptive to the distribution of the eplanator variables. ote that to recover the full uncertaint captured b the kernel function k, we can add the following term into the kernel function of : δ( )(k(, ) k(, )) (7) where δ(a) = if and onl if a =. Compared to the original model, which has a finite degree of freedom, this modified model has the infinite number of basis functions (assuming k has an infinite number of basis functions as the ARD kernel). Thus, this model can accuratel model the uncertaint of a test point even when it is far from the training set. We derive the optimization updates of all the hperparameters for both the original and modified models. But according to our eperiments, the modified model does not improve the prediction accurac over the original (it even reduces the accurac sometimes.). Therefore, we will focus on the original model in our presentation for its simplicit. Before we present details about hperparameter optimization, let us first look at an illustrative eample on the effect of hperparameter optimization, in particular, the optimization of the inducing inputs B. 3. Illustrative Eample For this eample, we consider a to dataset used b the algorithm [Snelson and Ghahramani, 6]. It contains one-dimensional training samples. We use 5 basis points (M = 5) and choose the ARD kernel. We then compare the basis functions {φ j } and the corresponding predictive distributions in two cases. For the first case, we use the kernel width η learned from the full GP model as the kernel width for, and appl K-means to set the basis points B as cluster centers. The idea of using K-means to set the basis points has been suggested b Zhang et al. [8] to minimize an error bound for the ström approimation. For Predictions Eigenfunctions Basis points learned from K-means Basis points estimated b evidence maimization 6 6 Figure : Illustration of the effect of optimizing basis points (i.e., inducing inputs). In the first row, the pink dots represent data points, the blue and red solid curves correspond to the predictive mean of and ± two standard deviations around the mean, the black curve corresponds to the predictive mean of the full GP, and the black crosses denote the basis points. In the second row, the curves in various colors represent the five eigenfunctions of. the second case, we optimize η, B and w b maimizing the marginal likelihood of. The results are shown in Figure. The first row demonstrates that, b optimizing the hperparameters, achieves the predictions ver close to what the full GP achieves but using onl 5 basis functions. In contrast, when the basis points are set to the cluster centers b K-means, leads to the prediction significantl different from that of the full GP and fails to capture the data trend, in particular, for (3, 5). The second row of Figure shows that K-means sets the basis points almost evenl spaced on, and accordingl the five eigenfunctions are smooth global basis functions whose shapes are not directl linked to the function the fit. Evidence maimization, b contrast, sets the basis points unevenl spaced to generate the basis functions whose shapes are more localized and adaptive to the function the fit; for eample, the eigenfunction represented b the red curve well matches the data on the right. Learning Hperparameters In this section we describe how we optimize all the hperparameters, denoted b θ, which include B in the covariance function (), and all the kernel hperparameters (e.g., a and η). To optimize θ, we maimize the marginal likelihood (i.e. evidence) based on a conjugate ewton method. We eplore two strategies for evidence maimization. The first one is sequential optimization, which first fies w while updating all the other hperparameters, and then optimizes w while fiing the other hperparameters. The second strateg is to optimize We use the code from code/matlab/doc 3765
4 all the hperparameters jointl. Here we skip the details for the more complicated joint optimization and describe the ke gradient formula for sequential optimization. The ke computation in our optimization is for the log marginal likelihood and its gradient: ln p( θ) = ln C T C ln(π), (8) [ d ln p( θ) = tr(c dc ) tr(c T C dc )] (9) where C = K + σ I, K = Φdiag(w)Φ T, and Φ = {φ m ( n )} is an b M matri. Because the rank of K is M, we can compute ln C and C efficientl with the cost of O(M ) via the matri inversion and determinant lemmas. Even with the use of the matri inverse lemma for the low-rank computation, a naive calculation would be ver costl. We appl identities from computational linear algebra [Minka, ; de Leeuw, 7] to simplif the needed computation dramaticall. To compute the derivative with respect to B, we first notice that, when w is fied, we have C = K + σ I = K XB K BB K BX + σ I () where K XB is the cross-covariance matri between the training data X and the inducing inputs B, and K BB is the covariance matri on B. For the ARD squared eponential kernel, utilizing the following identities, tr(p T Q) = vec(p) T vec(q) and vec(p Q) = diag(vec(p))vec(q), where vec( ) vectorizes a matri into a column vector, and represents the Hadamard product, we can derive the derivative of the first trace term in (9) tr(c dc ) = RX T diag(η) (R T ) (B T diag(η)) db SB T diag(η) + (S T ) (B T diag(η)) () where is a column vector of all ones, and R = (K BB K BX C ) K BX () S = (K BB K BX C K XBK BB ) K BB (3) ote that we can compute K BX C efficientl via low-rank updates. Also, R T (S T ) can be implemented efficientl b first summing over the columns of S (R) and then coping it multiple times without an multiplication operation. To obtain tr(c T C db and (3) b C dc ), we simpl replace C in () T C. Using similar derivations, we can obtain the derivatives with respect to the lengthscale η, a and σ respectivel. To compute the derivative with respect to w, we can use the formula tr(pdiag(w)q) = T (Q T P)w to obtain the two trace terms in (9) as follows: tr(c dc ) = T (Φ (C Φ)) dw () tr(c T C dc ) = T (Φ (C T C Φ)) dw (5) For either sequential or joint optimization, the overall computational compleit is O(ma(M, D)) where D is the data dimension. 5 Related Work Our work is closel related to the seminal work b [Williams and Seeger, ], but the differ in multiple aspects. First, we define a valid probabilistic model based on an eigendecomposition of the GP prior. B contrast, the previous approach [Williams and Seeger, ] aims at a low-rank approimation to the finite covariance/kernel matri used in GP training from a numerical approimation perspective and its predictive distribution is not well-formed in a probabilistic framework (e.g., it ma give a negative variance of the predictive distribution.). Second, while the ström method simpl uses the first few eigenvectors, we maimize the model marginal likelihood to adjust their weights in the covariance function. Third, eploring the clustering propert of the eigenfunctions of the Gaussian kernel, our approach can conduct semi-supervised learning, while the previous one cannot. The semi-supervised learning capabilit of is investigated in another paper of us. Fourth, the ström method lacks a principled wa to learn model hperparameters including the kernel width and the basis points while does not. Our work is also related to methods that use kernel principle component analsis (PCA) to speed up kernel machines [Hoegaerts et al., 5]. However, for these methods it can be difficult if not impossible to learn important hperparameters including kernel width for each dimension and inducing inputs (not a subset of the training samples). B contrast, learns all these hperparameters from data based on gradients of the model marginal likelihood. 6 Eperimental Results In this section, we compare and alternative methods on snthetic and real benchmark datasets. The alternative methods include the sparse GP methods,, and the ström method as well as RVMs. We implemented the ström method ourselves and downloaded the software implementations for the other methods from their authors websites. For RVMs, we used the fast fied point algorithm [Faul and Tipping, ]. We used the ARD kernel for all the methods ecept RVMs (since the do not estimate the lengthscales in this kernel) and optimized all the hperparameters via evidence maimization. For RVMs, we chose the squared eponential kernel with the same lengthscale for all the dimensions and applied a -fold cross-validation on the training data to select the lengthscale. On large real data, we used the values of η, a, and σ learned from the full GP on a subset that was / of the training data to initialize all the methods ecept RVMs. For the rest configurations, we used the default setting of the downloaded software packages. For our own model, we denote the versions with sequential and joint optimization as and *, respectivel. To evaluate the test performance of each method, we measure the ormalized Mean Square Error (MSE) and the Mean The implementation is available at:
5 egative Log Probabilit (MLP), defined as: MSE = i ( i µ i ) / i (µ i ȳ) (6) MLP = i µi i [( σ i ) + ln σi + ln π] (7) where i, µ i and σi are the response value, the predictive mean and variance for the i-th test point respectivel, and ȳ is the average response value of the training data. 6. Approimation Qualit on Snthetic Data Table : MSE and MLP on snthetic data MSE Method dataset dataset ström 39 ± 8 56 ± 769 ström*. ±.53 7 ± 37. ±.5.5 ±..5 ±.. ±.5.6 ±..6 ±..9 ±..6 ±. (a) ström (b) (c) (d) MLP Method dataset dataset ström 65 ± ± 67 ström* 7.39 ±.66 ± 5.7 ±..88 ±.5. ±.3.87 ±.9.33 ±.. ±.7.3 ±.. ± Figure : Predictions of four sparse GP methods. Pink dots represent training data; blue curves are predictive means; red curves are two standard deviations above and below the mean curves, and the black crosses indicate the inducing inputs. As in Section 3., we use the snthetic data from the paper for the comparative stud. To let all the methods have the same computational compleit, we set the number of inducing inputs M = 7. The results are summarized in Figure. For the ström method, we used the kernel width learned from the full GP and applied K-means to choose the basis locations [Zhang et al., 8]. Figure (a) shows that it does not fit well. Figure (b) demonstrates that the prediction of oscillates outside the range of the training samples, probabl due to the fact that the sinusoidal components are global and span the whole data range (increasing the number of basis functions would improve s predictive performance, but increase the computational cost.). As shown b Figure (c), fails to capture the turns of the data accuratel for near while can. Using the full GP predictive mean as the label for [, 7] (we do not have the true values in the test data), we compute the MSE and MLP of all the methods. The average results from runs are reported in Table (dataset ). We have the results from two versions of the ström method. For the first version, the kernel width is learned from the full GP and the basis locations are chosen b K-means as before; for the second version, denoted as ström*, its hperparameters are learned b evidence maimization. ote that the evidence maimization algorithm for the ström approimation is novel too developed b us for the comparative analsis. Table shows that both and * approimate the mean of the full GP model more accuratel than the other methods, in particular, several orders of magnitude better than the ström method. Furthermore, we add the difference term (7) into the kernel function and denote this version of our algorithm as +. It gives better predictive variance when far from the training data but its predictive mean is slightl worse than the version without this term (7); the MSE and MLP of + are. ±. and.8 ±.. Thus, on the other datasets, we onl use the versions without this term ( and ) for their simplicit and effectiveness. We also eamine the performance of all these methods with a higher computational compleit. Specificall, we set M =. Again, both versions of the ström method give poor predictive distributions. And still leads to etra wav patterns outside the training data., and + give good predictions. Again, + gives better predictive variance when far from the training data, but with a similar predictive mean as. Finall, we compare the RVM with * on this dataset. While the RVM gives MSE =.8 in. seconds, * achieves MSE =.39 ±.7 in.33 ±. second with M = 3 ( performs similarl), both faster and more accurate. 6. Prediction Qualit on onstationar Data We then compare all the sparse GP methods on an onedimensional nonstationar snthetic dataset with training and 5 test samples. The underling function is f() = sin( 3 ) where (, 3) and the standard deviation of the white noise is.5. This function is nonstationar in the sense that its frequenc and amplitude increase when increases from. We randoml generated the data times and set the number of basis points (functions) to be for all the com- 3767
6 (a) (b) (c) Figure 3: Predictions on nonstationar data. The pink dots correspond to nois data around the true function f() = sin( 3 ), represented b the green curves. The blue and red solid curves correspond to the predictive means and ± two standard deviations around the means. The black crosses near the bottom represent the estimated basis points for and. petitive methods. Using the true function value as the label, we compute the means and the standard errors of MSE and MLP as in Table (dataset ). For the ström method, the marginal likelihood optimization leads to much smaller error than the K-means based approach. However, both of them fare poorl when compared with the alternative methods. Table also shows that and achieve a striking 5, fold error reduction compared with ström*, and a -fold error reduction compared with the second best method,. RVMs gave MSE.±. with. ±.5 seconds, averaged over runs, while the results of * with M = 5 are MSE. ±.6 with.89 ±. seconds ( gives similar results). We further illustrate the predictive mean and standard deviation on a tpical run in Figure 3. As shown in Figure 3(a), the predictive mean of contains reasonable high frequenc components for (, 3) but, as a stationar GP model, these high frequenc components give etra wav patterns in the left region of. In addition, the predictive mean on the right is smaller than the true one, probabl affected b the small dnamic range of the data on the left. Figure 3(b) shows that the predictive mean of at (, 3) has lower frequenc and smaller amplitude than the true function perhaps influenced b the low-frequenc part on the left (, ). Actuall because of the low-frequenc part, learns a large kernel width η; the average kernel width learned over the runs is This large value affects the qualit of learned basis points (e.g., lacking of basis points for the high frequenc region on the right). B contrast, using the same initial kernel width as, learns a suitable kernel width on average, η =.7 and provides good predictions as shown in Figure 3(c). 6.3 Accurac vs. Time on Real Data To evaluate the trade-off between prediction accurac and computational cost, we use three large real datasets. The first dataset is California Housing [Pace and Barr, 997]. We randoml split the 8 dimensional data into, training and,6 test points. The second dataset is Phsicochemical Properties of Protein Tertiar Structures (PPPTS) which can be obtained from Lichman [3]. We randoml split the 9 dimensional data into, training and 5,73 test points. The third dataset is Pole Telecomm that was used in Làzaro-Gredilla et al. []. It contains, training and 5 test samples, each of which has 6 features. We set M = 5, 5,,,, and the maimum number of iterations in optimization to be for all methods. The MSE, MLP and the training time of these methods are shown in Figure. In addition, we ran the ström method based on the marginal likelihood maimization, which is better than using K-means to set the basis points. Again, the ström method performed orders of magnitude worse than the other methods: with M = 5, 5,,,, on California Housing, the ström method uses 6, 83, 3, 359 and 75 seconds for training, respectivel, and gives the MSE 97,, 3, 37, and 95; on PPPTS, the training times are 58, 3, 5, 853 and seconds and the MSEs are.7 5, 6.,.3, 8. 3, and 8. 3 ; and on Pole Telecomm, the training times are 79, 5, 67, 78, and 959 seconds and the MSEs are.3, 5.8 3, 3.8 3,.5 and 8. The MLPs are consistentl large, and are omitted here for simplicit. For RVMs, we include cross-validation in its training time because choosing an appropriate kernel width is crucial for RVM. Since RVM learns the number of basis functions automaticall from the data, in Figure it shows a single result for each dataset. achieves the lowest prediction error using shorter time. Compared with based on the sequential optimization, achieves similar errors, but takes longer because the joint optimization is more epensive. 7 Conclusions In this paper we have presented a simple et effective sparse Gaussian process method,, and applied it to regression. Despite its similarit to the ström method, can improve its prediction qualit b several orders of magnitude. can be easil etended to conduct online learning b either using stochastic gradient descent to update the weights of the eigenfunctions or appling the online VB idea for GPs [Hensman et al., 3]. Acknowledgments This work was supported b SF ECCS-93, SF CA- REER award IIS-593, and the Center for Science of Information, an SF Science and Technolog Center, under grant agreement CCF
7 (a) California Housing MSE (b) PPPTS MSE (c) Pole Telecomm MSE RVM * MLP RVM.6.55 *.5.5 MLP. 6.5 RVM..9 *.6.3 MLP 3 MSE * * * MLP Figure : MSE and MLP vs. training time. Each method (ecept RVMs) has five results associated with M = 5, 5,,,, respectivel. In (c), the fifth result of is out of the range; the actual training time is 385 seconds, the MSE.33, and the MLP.7. The values of MLP for RVMs are.3,.5 and.95, respectivel. References [Cressie and Johannesson, 8] oel Cressie and Gardar Johannesson. Fied rank kriging for ver large spatial data sets. Journal of the Roal Statistical Societ: Series B (Statistical Methodolog), 7():9 6, Februar 8. [Csató and Opper, ] Lehel Csató and Manfred Opper. Sparse online Gaussian processes. eural Computation, :6 668, March. [de Leeuw, 7] Jan de Leeuw. Derivatives of generalized eigen sstems with applications. In Department of Statistics Papers. Department of Statistics, UCLA, UCLA, 7. [Faul and Tipping, ] Anita C. Faul and Michael E. Tipping. Analsis of sparse Baesian learning. In Advances in eural Information Processing Sstems. MIT Press,. [Hensman et al., 3] James Hensman, icolò Fusi, and eil D. Lawrence. Gaussian processes for big data. In the 9th Conference on Uncertaint in Articial Intelligence (UAI 3), pages 8 9, 3. [Higdon, ] Dave Higdon. Space and space-time modeling using process convolutions. In Clive W. Anderson, Vic Barnett, Philip C. Chatwin, and Abdel H. El-Shaarawi, editors, Quantitative methods for current environmental issues, pages Springer Verlag,. [Hoegaerts et al., 5] Luc Hoegaerts, Johan A. K. Sukens, Joos Vandewalle, and Bart De Moor. Subset based least squares subspace regression in RKHS. eurocomputing, 63:93 33, Januar 5. [Lázaro-Gredilla et al., ] Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl E. Rasmussen, and Aníbal R. Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, :865 88,. [Lichman, 3] Moshe Lichman. UCI machine learning repositor, 3. [MacKa, 99] David J. C. MacKa. Baesian interpolation. eural Computation, (3):5 7, 99. [Marzouk and ajm, 9] Youssef M. Marzouk and Habib. ajm. Dimensionalit reduction and polnomial chaos acceleration of Baesian inference in inverse problems. Journal of Computational Phsics, 8(6):86 9, 9. [Minka, ] Thomas P. Minka. Old and new matri algebra useful for statistics. Technical report, MIT Media Lab,. [Pace and Barr, 997] R. Kelle Pace and Ronald Barr. Sparse spatial autoregressions. Statistics and Probabilit Letters, 33(3):9 97, 997. [Paciorek and Schervish, ] Christopher J. Paciorek and Mark J. Schervish. onstationar covariance functions for Gaussian process regression. In Advances in eural Information Processing Sstems 6. MIT Press,. [Qi et al., ] Yuan Qi, Ahmed H. Abdel-Gawad, and Thomas P. Minka. Sparse-posterior Gaussian processes for general likelihoods. In Proceedings of the 6th Conference on Uncertaint in Artificial Intelligence,. [Quiñonero-Candela and Rasmussen, 5] Joaquin Quiñonero- Candela and Carl E. Rasmussen. A unifing view of sparse approimate Gaussian process regression. Journal of Machine Learning Research, 6: , 5. [Snelson and Ghahramani, 6] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in eural Information Processing Sstems 8. MIT press, 6. [Tipping, ] Michael E. Tipping. The relevance vector machine. In Advances in eural Information Processing Sstems. MIT Press,. [Titsias, 9] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In The th International Conference on Artificial Intelligence and Statistics, pages , 9. [Williams and Barber, 998] Christopher K. I. Williams and David Barber. Baesian classification with Gaussian processes. IEEE Transactions on Pattern Analsis and Machine Intelligence, ():3 35, 998. [Williams and Seeger, ] Christopher K. I. Williams and Matthias Seeger. Using the ström method to speed up kernel machines. In Advances in eural Information Processing Sstems 3, volume 3. MIT Press,. [Yan and Qi, ] Feng Yan and Yuan Qi. Sparse Gaussian process regression via l penalization. In Proceedings of 7th International Conference on Machine Learning, pages 83 9,. [Zhang et al., 8] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved ström low-rank approimation and error analsis. In Proceedings of the 5th International Conference on Machine Learning, pages ACM,
Variable sigma Gaussian processes: An expectation propagation perspective
Variable sigma Gaussian processes: An expectation propagation perspective Yuan (Alan) Qi Ahmed H. Abdel-Gawad CS & Statistics Departments, Purdue University ECE Department, Purdue University alanqi@cs.purdue.edu
More informationVariational Model Selection for Sparse Gaussian Process Regression
Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationPerturbation Theory for Variational Inference
Perturbation heor for Variational Inference Manfred Opper U Berlin Marco Fraccaro echnical Universit of Denmark Ulrich Paquet Apple Ale Susemihl U Berlin Ole Winther echnical Universit of Denmark Abstract
More informationGaussian Process Vine Copulas for Multivariate Dependence
Gaussian Process Vine Copulas for Multivariate Dependence José Miguel Hernández-Lobato 1,2 joint work with David López-Paz 2,3 and Zoubin Ghahramani 1 1 Department of Engineering, Cambridge University,
More informationExplicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression
JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression
More informationSparse Spectral Sampling Gaussian Processes
Sparse Spectral Sampling Gaussian Processes Miguel Lázaro-Gredilla Department of Signal Processing & Communications Universidad Carlos III de Madrid, Spain miguel@tsc.uc3m.es Joaquin Quiñonero-Candela
More informationTree-structured Gaussian Process Approximations
Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary
More informationLecture 1c: Gaussian Processes for Regression
Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk
More informationDeep learning with differential Gaussian process flows
Deep learning with differential Gaussian process flows Pashupati Hegde Markus Heinonen Harri Lähdesmäki Samuel Kaski Helsinki Institute for Information Technology HIIT Department of Computer Science, Aalto
More information1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure
Overview Breiman L (2001). Statistical modeling: The two cultures Statistical learning reading group Aleander L 1 Histor of statistical/machine learning 2 Supervised learning 3 Two approaches to supervised
More informationNonstationary Covariance Functions for Gaussian Process Regression
Nonstationar Covariance Functions for Gaussian Process Regression Christopher J. Paciorek Department of Biostatistics Harvard School of Public Health Mark J. Schervish Department of Statistics Carnegie
More informationGaussian Processes for Big Data. James Hensman
Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationMultiple-step Time Series Forecasting with Sparse Gaussian Processes
Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ
More information6. Vector Random Variables
6. Vector Random Variables In the previous chapter we presented methods for dealing with two random variables. In this chapter we etend these methods to the case of n random variables in the following
More informationSparse Spectrum Gaussian Process Regression
Journal of Machine Learning Research 11 (2010) 1865-1881 Submitted 2/08; Revised 2/10; Published 6/10 Sparse Spectrum Gaussian Process Regression Miguel Lázaro-Gredilla Departamento de Teoría de la Señal
More informationGaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction
Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationComputation of Csiszár s Mutual Information of Order α
Computation of Csiszár s Mutual Information of Order Damianos Karakos, Sanjeev Khudanpur and Care E. Priebe Department of Electrical and Computer Engineering and Center for Language and Speech Processing
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationRelevance Vector Machines for Earthquake Response Spectra
2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas
More informationMachine Learning Lecture 3
Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First
More informationCross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if
More informationLocal and global sparse Gaussian process approximations
Local and global sparse Gaussian process approximations Edward Snelson Gatsby Computational euroscience Unit University College London, UK snelson@gatsby.ucl.ac.uk Zoubin Ghahramani Department of Engineering
More informationSpatial smoothing using Gaussian processes
Spatial smoothing using Gaussian processes Chris Paciorek paciorek@hsph.harvard.edu August 5, 2004 1 OUTLINE Spatial smoothing and Gaussian processes Covariance modelling Nonstationary covariance modelling
More informationINF Introduction to classifiction Anne Solberg
INF 4300 8.09.17 Introduction to classifiction Anne Solberg anne@ifi.uio.no Introduction to classification Based on handout from Pattern Recognition b Theodoridis, available after the lecture INF 4300
More informationINF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification
INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging
More informationWhat, exactly, is a cluster? - Bernhard Schölkopf, personal communication
Chapter 1 Warped Mixture Models What, exactly, is a cluster? - Bernhard Schölkopf, personal communication Previous chapters showed how the probabilistic nature of GPs sometimes allows the automatic determination
More informationA parametric approach to Bayesian optimization with pairwise comparisons
A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org
More informationMaximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine
Maximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine Y. Dak Hazirbaba, J. Tezcan, Q. Cheng Southern Illinois University Carbondale, IL, USA SUMMARY: The 2009
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationCSE 546 Midterm Exam, Fall 2014
CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationCircular Pseudo-Point Approximations for Scaling Gaussian Processes
Circular Pseudo-Point Approximations for Scaling Gaussian Processes Will Tebbutt Invenia Labs, Cambridge, UK will.tebbutt@invenialabs.co.uk Thang D. Bui University of Cambridge tdb4@cam.ac.uk Richard E.
More informationGaussian Process Vine Copulas for Multivariate Dependence
Gaussian Process Vine Copulas for Multivariate Dependence José Miguel Hernández Lobato 1,2, David López Paz 3,2 and Zoubin Ghahramani 1 June 27, 2013 1 University of Cambridge 2 Equal Contributor 3 Ma-Planck-Institute
More informationKernel PCA, clustering and canonical correlation analysis
ernel PCA, clustering and canonical correlation analsis Le Song Machine Learning II: Advanced opics CSE 8803ML, Spring 2012 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w j + b j 1 ξ j,
More informationA comparison of estimation accuracy by the use of KF, EKF & UKF filters
Computational Methods and Eperimental Measurements XIII 779 A comparison of estimation accurac b the use of KF EKF & UKF filters S. Konatowski & A. T. Pieniężn Department of Electronics Militar Universit
More informationDeep Nonlinear Non-Gaussian Filtering for Dynamical Systems
Deep Nonlinear Non-Gaussian Filtering for Dnamical Sstems Arash Mehrjou Department of Empirical Inference Ma Planck Institute for Intelligent Sstems arash.mehrjou@tuebingen.mpg.de Bernhard Schölkopf Department
More informationDeep Gaussian Processes
Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction
More informationarxiv: v1 [stat.ml] 2 Aug 2012
Ancestral Inference from Functional Data: Statistical Methods and Numerical Eamples Pantelis Z. Hadjipantelis, Nick, S. Jones, John Moriart, David Springate, Christopher G. Knight arxiv:1208.0628v1 [stat.ml]
More informationLinear regression Class 25, Jeremy Orloff and Jonathan Bloom
1 Learning Goals Linear regression Class 25, 18.05 Jerem Orloff and Jonathan Bloom 1. Be able to use the method of least squares to fit a line to bivariate data. 2. Be able to give a formula for the total
More informationLeast-Squares Independence Test
IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,
More informationRegression with Input-Dependent Noise: A Bayesian Treatment
Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationVariational Dropout via Empirical Bayes
Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationSINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES
SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationSemi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis
Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analsis Matthew B. Blaschko, Christoph H. Lampert, & Arthur Gretton Ma Planck Institute for Biological Cbernetics Tübingen, German
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationGaussian Processes. 1 What problems can be solved by Gaussian Processes?
Statistical Techniques in Robotics (16-831, F1) Lecture#19 (Wednesday November 16) Gaussian Processes Lecturer: Drew Bagnell Scribe:Yamuna Krishnamurthy 1 1 What problems can be solved by Gaussian Processes?
More informationHTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551
HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/55 Outline Framework Eact Minimize Mistakes Perceptron Training Matri inversion LMS Logistic Regression Ma Likelihood Estimation MLE of P Gradient
More informationKrylov Integration Factor Method on Sparse Grids for High Spatial Dimension Convection Diffusion Equations
DOI.7/s95-6-6-7 Krlov Integration Factor Method on Sparse Grids for High Spatial Dimension Convection Diffusion Equations Dong Lu Yong-Tao Zhang Received: September 5 / Revised: 9 March 6 / Accepted: 8
More informationThe Automatic Statistician
The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and
More informationESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,
Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication
More informationNon-Factorised Variational Inference in Dynamical Systems
st Symposium on Advances in Approximate Bayesian Inference, 08 6 Non-Factorised Variational Inference in Dynamical Systems Alessandro D. Ialongo University of Cambridge and Max Planck Institute for Intelligent
More information2: Distributions of Several Variables, Error Propagation
: Distributions of Several Variables, Error Propagation Distribution of several variables. variables The joint probabilit distribution function of two variables and can be genericall written f(, with the
More informationExpectation propagation for signal detection in flat-fading channels
Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA
More informationState Space Gaussian Processes with Non-Gaussian Likelihoods
State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationCONTINUOUS SPATIAL DATA ANALYSIS
CONTINUOUS SPATIAL DATA ANALSIS 1. Overview of Spatial Stochastic Processes The ke difference between continuous spatial data and point patterns is that there is now assumed to be a meaningful value, s
More informationBayesian Inference of Noise Levels in Regression
Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop
More informationProper Orthogonal Decomposition Extensions For Parametric Applications in Transonic Aerodynamics
Proper Orthogonal Decomposition Etensions For Parametric Applications in Transonic Aerodnamics T. Bui-Thanh, M. Damodaran Singapore-Massachusetts Institute of Technolog Alliance (SMA) School of Mechanical
More informationA New Perspective and Extension of the Gaussian Filter
Robotics: Science and Sstems 2015 Rome, Ital, Jul 13-17, 2015 A New Perspective and Etension of the Gaussian Filter Manuel Wüthrich, Sebastian Trimpe, Daniel Kappler and Stefan Schaal Autonomous Motion
More informationNon Linear Latent Variable Models
Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable
More informationWhere now? Machine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationGaussian Process Approximations of Stochastic Differential Equations
Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationBayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge
Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference
More informationBAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS
BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS Oloyede I. Department of Statistics, University of Ilorin, Ilorin, Nigeria Corresponding Author: Oloyede I.,
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationCS 7140: Advanced Machine Learning
Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)
More informationSCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati
SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes
Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationMathematics 309 Conic sections and their applicationsn. Chapter 2. Quadric figures. ai,j x i x j + b i x i + c =0. 1. Coordinate changes
Mathematics 309 Conic sections and their applicationsn Chapter 2. Quadric figures In this chapter want to outline quickl how to decide what figure associated in 2D and 3D to quadratic equations look like.
More informationTe Whare Wananga o te Upoko o te Ika a Maui. Computer Science
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui School of Mathematical and Computing Sciences Computer Science Multiple Output Gaussian Process Regression Phillip Boyle and
More informationApplications of Proper Orthogonal Decomposition for Inviscid Transonic Aerodynamics
Applications of Proper Orthogonal Decomposition for Inviscid Transonic Aerodnamics Bui-Thanh Tan, Karen Willco and Murali Damodaran Abstract Two etensions to the proper orthogonal decomposition (POD) technique
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More information